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PREFACE 


Changes to the Fourth Edition 


e I have reorganized many main results that were included in the body of the 


text by labeling them as theorems in order to facilitate students in finding and 
referencing these results. 


I have pulled the important defintions and assumptions out of the body of the 
text and labeled them as such so that they stand out better. 


When anew topic is introduced, I introduce it with a motivating example before 
delving into the mathematical formalities. Then I return to the example to 
illustrate the newly introduced material. 


I moved the material on the law of large numbers and the central limit theorem 
to a new Chapter 6. It seemed more natural to deal with the main large-sample 
results together. 


I moved the section on Markov chains into Chapter 3. Every time I cover this 
material with my own students, I stumble over not being able to refer to random 
variables, distributions, and conditional distributions. I have actually postponed 
this material until after introducing distributions, and then gone back to cover 
Markov chains. I feel that the time has come to place it in a more natural 
location. I also added some material on stationary distributions of Markov 
chains. 


I have moved the lengthy proofs of several theorems to the ends of their 
respective sections in order to improve the flow of the presentation of ideas. 

I rewrote Section 7.1 to make the introduction to inference clearer. 

I rewrote Section 9.1 as a more complete introduction to hypothesis testing, 
including likelihood ratio tests. For instructors not interested in the more math- 
ematical theory of hypothesis testing, it should now be easier to skip from 
Section 9.1 directly to Section 9.5. 


Some other changes that readers will notice: 


I have replaced the notation in which the intersection of two sets A and B had 
been represented AB with the more popular AM B. The old notation, although 
mathematically sound, seemed a bit arcane for a text at this level. 

I added the statements of Stirling’s formula and Jensen’s inequality. 

I moved the law of total probability and the discussion of partitions of a sample 
space from Section 2.3 to Section 2.1. 

I define the cumulative distribution function (c.d.f.) as the prefered name of 
what used to be called only the distribution function (d.f.). 

I added some discussion of histograms in Chapters 3 and 6. 


Irearranged the topics in Sections 3.8 and 3.9 so that simple functions of random 
variables appear first and the general formulations appear at the end to make 
it easier for instructors who want to avoid some of the more mathematically 
challenging parts. 


I emphasized the closeness of a hypergeometric distribution with a large num- 
ber of available items to a binomial distribution. 


xi 


xil 
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¢ I gave a brief introduction to Chernoff bounds. These are becoming increasingly 
important in computer science, and their derivation requires only material that 
is already in the text. 


¢ I changed the definition of confidence interval to refer to the random interval 
rather than the observed interval. This makes statements less cumbersome, and 
it corresponds to more modern usage. 


¢ I added a brief discussion of the method of moments in Section 7.6. 


¢ I added brief introductions to Newton’s method and the EM algorithm in 
Chapter 7. 


¢ J introduced the concept of pivotal quantity to facilitate construction of confi- 
dence intervals in general. 


¢ I added the statement of the large-sample distribution of the likelihood ratio 
test statistic. I then used this as an alternative way to test the null hypothesis 
that two normal means are equal when it is not assumed that the variances are 
equal. 


¢ I moved the Bonferroni inequality into the main text (Chapter 1) and later 
(Chapter 11) used it as a way to construct simultaneous tests and confidence 
intervals. 


How to Use This Book 


The text is somewhat long for complete coverage in a one-year course at the under- 
graduate level and is designed so that instructors can make choices about which topics 
are most important to cover and which can be left for more in-depth study. As an ex- 
ample, many instructors wish to deemphasize the classical counting arguments that 
are detailed in Sections 1.7-1.9. An instructor who only wants enough information 
to be able to cover the binomial and/or multinomial distributions can safely dis- 
cuss only the definitions and theorems on permutations, combinations, and possibly 
multinomial coefficients. Just make sure that the students realize what these values 
count, otherwise the associated distributions will make no sense. The various exam- 
ples in these sections are helpful, but not necessary, for understanding the important 
distributions. Another example is Section 3.9 on functions of two or more random 
variables. The use of Jacobians for general multivariate transformations might be 
more mathematics than the instructors of some undergraduate courses are willing 
to cover. The entire section could be skipped without causing problems later in the 
course, but some of the more straightforward cases early in the section (such as con- 
volution) might be worth introducing. The material in Sections 9.2-9.4 on optimal 
tests in one-parameter families is pretty mathematics, but it is of interest primarily 
to graduate students who require a very deep understanding of hypothesis testing 
theory. The rest of Chapter 9 covers everything that an undergraduate course really 
needs. 

In addition to the text, the publisher has an Jnstructor’s Solutions Manual, avail- 
able for download from the Instructor Resource Center at www.pearsonhighered 
.com/irc, which includes some specific advice about many of the sections of the text. 
I have taught a year-long probability and statistics sequence from earlier editions of 
this text for a group of mathematically well-trained juniors and seniors. In the first 
semester, I covered what was in the earlier edition but is now in the first five chap- 
ters (including the material on Markov chains) and parts of Chapter 6. In the second 
semester, I covered the rest of the new Chapter 6, Chapters 7-9, Sections 11.1-11.5, 
and Chapter 12. I have also taught a one-semester probability and random processes 
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course for engineers and computer scientists. I covered what was in the old edition 
and is now in Chapters 1-6 and 12, including Markov chains, but not Jacobians. This 
latter course did not emphasize mathematical derivation to the same extent as the 
course for mathematics students. 

A number of sections are designated with an asterisk (*). This indicates that 
later sections do not rely materially on the material in that section. This designation 
is not intended to suggest that instructors skip these sections. Skipping one of these 
sections will not cause the students to miss definitions or results that they will need 
later. The sections are 2.4, 3.10, 4.8, 7.7, 7.8, 7.9, 8.6, 8.8, 9.2, 9.3, 9.4, 9.8, 9.9, 10.6, 
10.7, 10.8, 11.4, 11.7, 11.8, and 12.5. Aside from cross-references between sections 
within this list, occasional material from elsewhere in the text does refer back to 
some of the sections in this list. Each of the dependencies is quite minor, however. 
Most of the dependencies involve references from Chapter 12 back to one of the 
optional sections. The reason for this is that the optional sections address some of 
the more difficult material, and simulation is most useful for solving those difficult 
problems that cannot be solved analytically. Except for passing references that help 
put material into context, the dependencies are as follows: 


¢ The sample distribution function (Section 10.6) is reintroduced during the 
discussion of the bootstrap in Section 12.6. The sample distribution function 
is also a useful tool for displaying simulation results. It could be introduced as 
early as Example 12.3.7 simply by covering the first subsection of Section 10.6. 


¢ The material on robust estimation (Section 10.7) is revisited in some simulation 
exercises in Section 12.2 (Exercises 4, 5, 7, and 8). 


¢ Example 12.3.4 makes reference to the material on two-way analysis of variance 
(Sections 11.7 and 11.8). 


Supplements 


The text is accompanied by the following supplementary material: 


¢ Instructor’s Solutions Manual contains fully worked solutions to all exercises 
in the text. Available for download from the Instructor Resource Center at 
www.pearsonhighered.com/irc. 


¢ Student Solutions Manual contains fully worked solutions to all odd exercises in 
the text. Available for purchase from MyPearsonStore at www.mypearsonstore 
com. (ISBN-13: 978-0-321-71598-2; ISBN-10: 0-321-71598-5) 
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1.1 The History of Probability 


The use of probability to measure uncertainty and variability dates back hundreds 
of years. Probability has found application in areas as diverse as medicine, gam- 
bling, weather forecasting, and the law. 


The concepts of chance and uncertainty are as old as civilization itself. People have 
always had to cope with uncertainty about the weather, their food supply, and other 
aspects of their environment, and have striven to reduce this uncertainty and its 
effects. Even the idea of gambling has a long history. By about the year 3500 B.c., 
games of chance played with bone objects that could be considered precursors of 
dice were apparently highly developed in Egypt and elsewhere. Cubical dice with 
markings virtually identical to those on modern dice have been found in Egyptian 
tombs dating from 2000 B.c. We know that gambling with dice has been popular ever 
since that time and played an important part in the early development of probability 
theory. 

It is generally believed that the mathematical theory of probability was started by 
the French mathematicians Blaise Pascal (1623-1662) and Pierre Fermat (1601-1665) 
when they succeeded in deriving exact probabilities for certain gambling problems 
involving dice. Some of the problems that they solved had been outstanding for about 
300 years. However, numerical probabilities of various dice combinations had been 
calculated previously by Girolamo Cardano (1501-1576) and Galileo Galilei (1564— 
1642). 

The theory of probability has been developed steadily since the seventeenth 
century and has been widely applied in diverse fields of study. Today, probability 
theory is an important tool in most areas of engineering, science, and management. 
Many research workers are actively engaged in the discovery and establishment of 
new applications of probability in fields such as medicine, meteorology, photography 
from satellites, marketing, earthquake prediction, human behavior, the design of 
computer systems, finance, genetics, and law. In many legal proceedings involving 
antitrust violations or employment discrimination, both sides will present probability 
and statistical calculations to help support their cases. 
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References 


The ancient history of gambling and the origins of the mathematical theory of prob- 
ability are discussed by David (1988), Ore (1960), Stigler (1986), and Todhunter 
(1865). 

Some introductory books on probability theory, which discuss many of the same 
topics that will be studied in this book, are Feller (1968); Hoel, Port, and Stone (1971); 
Meyer (1970); and Olkin, Gleser, and Derman (1980). Other introductory books, 
which discuss both probability theory and statistics at about the same level as they 
will be discussed in this book, are Brunk (1975); Devore (1999); Fraser (1976); Hogg 
and Tanis (1997); Kempthorne and Folks (1971); Larsen and Marx (2001); Larson 
(1974); Lindgren (1976); Miller and Miller (1999); Mood, Graybill, and Boes (1974); 
Rice (1995); and Wackerly, Mendenhall, and Schaeffer (2008). 


1.2 Interpretations of Probability 


This section describes three common operational interpretations of probability. 
Although the interpretations may seem incompatible, it is fortunate that the calcu- 
lus of probability (the subject matter of the first six chapters of this book) applies 
equally well no matter which interpretation one prefers. 


In addition to the many formal applications of probability theory, the concept of 
probability enters our everyday life and conversation. We often hear and use such 
expressions as “It probably will rain tomorrow afternoon,” “It is very likely that 
the plane will arrive late,” or “The chances are good that he will be able to join us 
for dinner this evening.” Each of these expressions is based on the concept of the 
probability, or the likelihood, that some specific event will occur. 

Despite the fact that the concept of probability is such a common and natural 
part of our experience, no single scientific interpretation of the term probability is 
accepted by all statisticians, philosophers, and other authorities. Through the years, 
each interpretation of probability that has been proposed by some authorities has 
been criticized by others. Indeed, the true meaning of probability is still a highly 
controversial subject and is involved in many current philosophical discussions per- 
taining to the foundations of statistics. Three different interpretations of probability 
will be described here. Each of these interpretations can be very useful in applying 
probability theory to practical problems. 


The Frequency Interpretation of Probability 


In many problems, the probability that some specific outcome of a process will be 
obtained can be interpreted to mean the relative frequency with which that outcome 
would be obtained if the process were repeated a large number of times under similar 
conditions. For example, the probability of obtaining a head when a coin is tossed is 
considered to be 1/2 because the relative frequency of heads should be approximately 
1/2 when the coin is tossed a large number of times under similar conditions. In other 
words, it is assumed that the proportion of tosses on which a head is obtained would 
be approximately 1/2. 

Of course, the conditions mentioned in this example are too vague to serve as the 
basis for a scientific definition of probability. First, a “large number” of tosses of the 
coin is specified, but there is no definite indication of an actual number that would 
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be considered large enough. Second, it is stated that the coin should be tossed each 
time “under similar conditions,” but these conditions are not described precisely. The 
conditions under which the coin is tossed must not be completely identical for each 
toss because the outcomes would then be the same, and there would be either all 
heads or all tails. In fact, a skilled person can toss a coin into the air repeatedly and 
catch it in such a way that a head is obtained on almost every toss. Hence, the tosses 
must not be completely controlled but must have some “random” features. 
Furthermore, it is stated that the relative frequency of heads should be “approx- 
imately 1/2,” but no limit is specified for the permissible variation from 1/2. If a coin 
were tossed 1,000,000 times, we would not expect to obtain exactly 500,000 heads. 
Indeed, we would be extremely surprised if we obtained exactly 500,000 heads. On 
the other hand, neither would we expect the number of heads to be very far from 
500,000. It would be desirable to be able to make a precise statement of the like- 
lihoods of the different possible numbers of heads, but these likelihoods would of 
necessity depend on the very concept of probability that we are trying to define. 
Another shortcoming of the frequency interpretation of probability is that it 
applies only to a problem in which there can be, at least in principle, a large number of 
similar repetitions of a certain process. Many important problems are not of this type. 
For example, the frequency interpretation of probability cannot be applied directly 
to the probability that a specific acquaintance will get married within the next two 
years or to the probability that a particular medical research project will lead to the 
development of anew treatment for a certain disease within a specified period of time. 


The Classical Interpretation of Probability 


The classical interpretation of probability is based on the concept of equally likely 
outcomes. For example, when a coin is tossed, there are two possible outcomes: a 
head or a tail. If it may be assumed that these outcomes are equally likely to occur, 
then they must have the same probability. Since the sum of the probabilities must 
be 1, both the probability of a head and the probability of a tail must be 1/2. More 
generally, if the outcome of some process must be one of n different outcomes, and 
if these n outcomes are equally likely to occur, then the probability of each outcome 
is 1/n. 

Two basic difficulties arise when an attempt is made to develop a formal defi- 
nition of probability from the classical interpretation. First, the concept of equally 
likely outcomes is essentially based on the concept of probability that we are trying 
to define. The statement that two possible outcomes are equally likely to occur is the 
same as the statement that two outcomes have the same probability. Second, no sys- 
tematic method is given for assigning probabilities to outcomes that are not assumed 
to be equally likely. When a coin is tossed, or a well-balanced die is rolled, or a card is 
chosen from a well-shuffled deck of cards, the different possible outcomes can usually 
be regarded as equally likely because of the nature of the process. However, when the 
problem is to guess whether an acquaintance will get married or whether a research 
project will be successful, the possible outcomes would not typically be considered 
to be equally likely, and a different method is needed for assigning probabilities to 
these outcomes. 


The Subjective Interpretation of Probability 


According to the subjective, or personal, interpretation of probability, the probability 
that a person assigns to a possible outcome of some process represents her own 
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judgment of the likelihood that the outcome will be obtained. This judgment will be 
based on each person’s beliefs and information about the process. Another person, 
who may have different beliefs or different information, may assign a different 
probability to the same outcome. For this reason, it is appropriate to speak of a 
certain person’s subjective probability of an outcome, rather than to speak of the 
true probability of that outcome. 

As an illustration of this interpretation, suppose that a coin is to be tossed once. 
A person with no special information about the coin or the way in which it is tossed 
might regard a head and a tail to be equally likely outcomes. That person would 
then assign a subjective probability of 1/2 to the possibility of obtaining a head. The 
person who is actually tossing the coin, however, might feel that a head is much 
more likely to be obtained than a tail. In order that people in general may be able 
to assign subjective probabilities to the outcomes, they must express the strength of 
their belief in numerical terms. Suppose, for example, that they regard the likelihood 
of obtaining a head to be the same as the likelihood of obtaining a red card when one 
card is chosen from a well-shuffled deck containing four red cards and one black card. 
Because those people would assign a probability of 4/5 to the possibility of obtaining 
a red card, they should also assign a probability of 4/5 to the possibility of obtaining 
a head when the coin is tossed. 

This subjective interpretation of probability can be formalized. In general, if 
people’s judgments of the relative likelihoods of various combinations of outcomes 
satisfy certain conditions of consistency, then it can be shown that their subjective 
probabilities of the different possible events can be uniquely determined. However, 
there are two difficulties with the subjective interpretation. First, the requirement 
that a person’s judgments of the relative likelihoods of an infinite number of events 
be completely consistent and free from contradictions does not seem to be humanly 
attainable, unless a person is simply willing to adopt a collection of judgments known 
to be consistent. Second, the subjective interpretation provides no “objective” basis 
for two or more scientists working together to reach a common evaluation of the 
state of knowledge in some scientific area of common interest. 

On the other hand, recognition of the subjective interpretation of probability 
has the salutary effect of emphasizing some of the subjective aspects of science. A 
particular scientist’s evaluation of the probability of some uncertain outcome must 
ultimately be that person’s own evaluation based on all the evidence available. This 
evaluation may well be based in part on the frequency interpretation of probability, 
since the scientist may take into account the relative frequency of occurrence of this 
outcome or similar outcomes in the past. It may also be based in part on the classical 
interpretation of probability, since the scientist may take into account the total num- 
ber of possible outcomes that are considered equally likely to occur. Nevertheless, 
the final assignment of numerical probabilities is the responsibility of the scientist 
herself. 

The subjective nature of science is also revealed in the actual problem that a 
particular scientist chooses to study from the class of problems that might have 
been chosen, in the experiments that are selected in carrying out this study, and 
in the conclusions drawn from the experimental data. The mathematical theory of 
probability and statistics can play an important part in these choices, decisions, and 
conclusions. 


Note: The Theory of Probability Does Not Depend on Interpretation. The math- 
ematical theory of probability is developed and presented in Chapters 1-6 of this 
book without regard to the controversy surrounding the different interpretations of 
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the term probability. This theory is correct and can be usefully applied, regardless of 
which interpretation of probability is used in a particular problem. The theories and 
techniques that will be presented in this book have served as valuable guides and 
tools in almost all aspects of the design and analysis of effective experimentation. 


1.3 Experiments and Events 


Probability will be the way that we quantify how likely something is to occur (in 
the sense of one of the interpretations in Sec. 1.2). In this section, we give examples 
of the types of situations in which probability will be used. 


Types of Experiments 


The theory of probability pertains to the various possible outcomes that might be 
obtained and the possible events that might occur when an experiment is performed. 


Experiment and Event. An experiment is any process, real or hypothetical, in which 
the possible outcomes can be identified ahead of time. An event is a well-defined set 
of possible outcomes of the experiment. 


The breadth of this definition allows us to call almost any imaginable process an 
experiment whether or not its outcome will ever be known. The probability of each 
event will be our way of saying how likely it is that the outcome of the experiment is 
in the event. Not every set of possible outcomes will be called an event. We shall be 
more specific about which subsets count as events in Sec. 1.4. 

Probability will be most useful when applied to a real experiment in which the 
outcome is not known in advance, but there are many hypothetical experiments that 
provide useful tools for modeling real experiments. A common type of hypothetical 
experiment is repeating a well-defined task infinitely often under similar conditions. 
Some examples of experiments and specific events are given next. In each example, 
the words following “the probability that” describe the event of interest. 


1. Inanexperiment in which a coin is to be tossed 10 times, the experimenter might 
want to determine the probability that at least four heads will be obtained. 


2. In an experiment in which a sample of 1000 transistors is to be selected from 
a large shipment of similar items and each selected item is to be inspected, a 
person might want to determine the probability that not more than one of the 
selected transistors will be defective. 


3. In an experiment in which the air temperature at a certain location is to be 
observed every day at noon for 90 successive days, a person might want to 
determine the probability that the average temperature during this period will 
be less than some specified value. 


4. From information relating to the life of Thomas Jefferson, a person might want 
to determine the probability that Jefferson was born in the year 1741. 


5. In evaluating an industrial research and development project at a certain time, 
a person might want to determine the probability that the project will result 
in the successful development of a new product within a specified number of 
months. 
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The Mathematical Theory of Probability 


As was explained in Sec. 1.2, there is controversy in regard to the proper meaning 
and interpretation of some of the probabilities that are assigned to the outcomes 
of many experiments. However, once probabilities have been assigned to some 
simple outcomes in an experiment, there is complete agreement among all authorities 
that the mathematical theory of probability provides the appropriate methodology 
for the further study of these probabilities. Almost all work in the mathematical 
theory of probability, from the most elementary textbooks to the most advanced 
research, has been related to the following two problems: (i) methods for determining 
the probabilities of certain events from the specified probabilities of each possible 
outcome of an experiment and (ii) methods for revising the probabilities of events 
when additional relevant information is obtained. 

These methods are based on standard mathematical techniques. The purpose of 
the first six chapters of this book is to present these techniques, which, together, form 
the mathematical theory of probability. 


1.4 Set Theory 


This section develops the formal mathematical model for events, namely, the theory 
of sets. Several important concepts are introduced, namely, element, subset, empty 
set, intersection, union, complement, and disjoint sets. 


The Sample Space 


Sample Space. The collection of all possible outcomes of an experiment is called the 
sample space of the experiment. 


The sample space of an experiment can be thought of as a set, or collection, of 
different possible outcomes; and each outcome can be thought of as a point, or an 
element, in the sample space. Similarly, events can be thought of as subsets of the 
sample space. 


Rolling a Die. When a six-sided die is rolled, the sample space can be regarded as 
containing the six numbers 1, 2, 3, 4, 5, 6, each representing a possible side of the die 
that shows after the roll. Symbolically, we write 


S = {1, 2, 3, 4, 5, 6}. 


One event A is that an even number is obtained, and it can be represented as the 
subset A = {2, 4, 6}. The event B that a number greater than 2 is obtained is defined 
by the subset B = {3, 4, 5, 6}. < 


Because we can interpret outcomes as elements of a set and events as subsets 
of a set, the language and concepts of set theory provide a natural context for the 
development of probability theory. The basic ideas and notation of set theory will 
now be reviewed. 
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Relations of Set Theory 


Let S denote the sample space of some experiment. Then each possible outcome s 
of the experiment is said to be a member of the space S, or to belong to the space S. 
The statement that s is a member of S is denoted symbolically by the relation s € S. 

When an experiment has been performed and we say that some event E has 
occurred, we mean two equivalent things. One is that the outcome of the experiment 
satisfied the conditions that specified that event E. The other is that the outcome, 
considered as a point in the sample space, is an element of E. 

To be precise, we should say which sets of outcomes correspond to events as de- 
fined above. In many applications, such as Example 1.4.1, it will be clear which sets of 
outcomes should correspond to events. In other applications (such as Example 1.4.5 
coming up later), there are too many sets available to have them all be events. Ide- 
ally, we would like to have the largest possible collection of sets called events so that 
we have the broadest possible applicability of our probability calculations. However, 
when the sample space is too large (as in Example 1.4.5) the theory of probability 
simply will not extend to the collection of all subsets of the sample space. We would 
prefer not to dwell on this point for two reasons. First, a careful handling requires 
mathematical details that interfere with an initial understanding of the important 
concepts, and second, the practical implications for the results in this text are min- 
imal. In order to be mathematically correct without imposing an undue burden on 
the reader, we note the following. In order to be able to do all of the probability cal- 
culations that we might find interesting, there are three simple conditions that must 
be met by the collection of sets that we call events. In every problem that we see in 
this text, there exists a collection of sets that includes all the sets that we will need to 
discuss and that satisfies the three conditions, and the reader should assume that such 
a collection has been chosen as the events. For a sample space S with only finitely 
many outcomes, the collection of all subsets of S satisfies the conditions, as the reader 
can show in Exercise 12 in this section. 

The first of the three conditions can be stated immediately. 


The sample space S must be an event. 


That is, we must include the sample space S in our collection of events. The other two 
conditions will appear later in this section because they require additional definitions. 
Condition 2 is on page 9, and Condition 3 is on page 10. 


Containment. It is said that a set A is contained in another set B if every element 
of the set A also belongs to the set B. This relation between two events is expressed 
symbolically by the expression A C B, whichis the set-theoretic expression for saying 
that A is a subset of B. Equivalently, if A C B, we may say that B contains A and may 
write BD A. 


For events, to say that A C B means that if A occurs then so does B. 
The proof of the following result is straightforward and is omitted. 


Let A, B, and C be events. Then ACS.If AC Band BCA,thenA=B.If ACB 
andBCcC,thenACC. | 


Rolling a Die. In Example 1.4.1, suppose that A is the event that an even number 
is obtained and C is the event that a number greater than 1 is obtained. Since 
A = {2, 4, 6} and C = {2, 3, 4, 5, 6}, it follows that A CC. < 
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The Empty Set Some events are impossible. For example, when a die is rolled, it 
is impossible to obtain a negative number. Hence, the event that a negative number 
will be obtained is defined by the subset of S that contains no outcomes. 


Empty Set. The subset of S that contains no elements is called the empty set, or null 
set, and it is denoted by the symbol 9. 


In terms of events, the empty set is any event that cannot occur. 
Let A be an event. Then 4 Cc A. 


Proof Let A be an arbitrary event. Since the empty set 4 contains no points, it is 
logically correct to say that every point belonging to J also belongs to A, or @C A. 
7 


Finite and Infinite Sets Some sets contain only finitely many elements, while others 
have infinitely many elements. There are two sizes of infinite sets that we need to 
distinguish. 


Countable/Uncountable. An infinite set A is countable if there is a one-to-one corre- 
spondence between the elements of A and the set of natural numbers {1, 2, 3,...}.A 
set is uncountable if it is neither finite nor countable. If we say that a set has at most 
countably many elements, we mean that the set is either finite or countable. 


Examples of countably infinite sets include the integers, the even integers, the odd 
integers, the prime numbers, and any infinite sequence. Each of these can be put 
in one-to-one correspondence with the natural numbers. For example, the following 
function f puts the integers in one-to-one correspondence with the natural numbers: 


n-l1 ; : 
fire a en is odd, 

—5 ifnis even. 
Every infinite sequence of distinct items is a countable set, as its indexing puts it in 
one-to-one correspondence with the natural numbers. Examples of uncountable sets 
include the real numbers, the positive reals, the numbers in the interval [0, 1], and the 
set of all ordered pairs of real numbers. An argument to show that the real numbers 
are uncountable appears at the end of this section. Every subset of the integers has 
at most countably many elements. 


Operations of Set Theory 


Complement. The complement of a set A is defined to be the set that contains all 
elements of the sample space S$ that do not belong to A. The notation for the 
complement of A is A‘. 


In terms of events, the event A° is the event that A does not occur. 


Rolling a Die. In Example 1.4.1, suppose again that A is the event that an even number 
is rolled; then A‘ = {1, 3, 5} is the event that an odd number is rolled. <1 


We can now state the second condition that we require of the collection of events. 


Figure I. The event A‘. 


Figure 1.2 The set AU B. 
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If A is an event, then A‘ is also an event. 


That is, for each set A of outcomes that we call an event, we must also call its 
complement A‘ an event. 

A generic version of the relationship between A and A‘ is sketched in Fig. 1.1. 
A sketch of this type is called a Venn diagram. 

Some properties of the complement are stated without proof in the next result. 


Let A be an event. Then 
(A‘S)f = A, Oe = Si: Aca = 0. 


The empty event @ is an event. a 


Union of Two Sets. If A and B are any two sets, the union of A and B is defined to be 
the set containing all outcomes that belong to A alone, to B alone, or to both A and 
B. The notation for the union of A and B is AU B. 


The set A U B is sketched in Fig. 1.2. In terms of events, A U B is the event that either 
A or B or both occur. 
The union has the following properties whose proofs are left to the reader. 


For all sets A and B, 
AUB=BUA, AUA=A, AUAS=S, 
AUG=A, AUS=S. 
Furthermore, if A Cc B, then AU B= B. | 


The concept of union extends to more than two sets. 


Union of Many Sets. The union of n sets Ay,..., A, is defined to be the set that 
contains all outcomes that belong to at least one of these n sets. The notation for this 
union is either of the following: 
n 
A,UA,U---UA, or J Aj. 
i=1 
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Figure I.3 The set AN B. 


Similarly, the union of an infinite sequence of sets Aj, A>, ... is the set that contains 
all outcomes that belong to at least one of the events in the sequence. The infinite 
union is denoted by LJ°, 4;. 


In terms of events, the union of a collection of events is the event that at least 
one of the events in the collection occurs. 

We can now state the final condition that we require for the collection of sets 
that we call events. 


If Ay, Ap, .. . is a countable collection of events, then (°° , A; is also an event. 


In other words, if we choose to call each set of outcomes in some countable collection 
an event, we are required to call their union an event also. We do not require that 
the union of an arbitrary collection of events be an event. To be clear, let J be an 
arbitrary set that we use to index a general collection of events {A; :i € J}. The union 
of the events in this collection is the set of outcomes that are in at least one of the 
events in the collection. The notation for this union is ),-, A;. We do not require 
that ;-, A; be an event unless / is countable. 

Condition 3 refers to a countable collection of events. We can prove that the 
condition also applies to every finite collection of events. 


ie] 


The union of a finite number of events A;,..., A, is an event. 


Proof Foreachm=n+1,n+2,..., define A,, =¥%. Because is an event, we now 
have a countable collection A, A>, ... of events. It follows from Condition 3 that 
U?>_, Am is an event. But it is easy to see that UP?_, Am =U" _1 Am- a 


m=1*"m m=1*"m 


The union of three events A, B, and C can be constructed either directly from the 
definition of A U B UC or by first evaluating the union of any two of the events and 
then forming the union of this combination of events and the third event. In other 
words, the following result is true. 


Associative Property. For every three events A, B, and C, the following associative 
relations are satisfied: 


AUBUC=(AUB)UC=AU(BUC). a 


Intersection of Two Sets. If A and B are any two sets, the intersection of A and B is 
defined to be the set that contains all outcomes that belong both to A and to B. The 
notation for the intersection of A and Bis AN B. 


The set A B is sketched in a Venn diagram in Fig. 1.3. In terms of events, AN B is 
the event that both A and B occur. 

The proof of the first part of the next result follows from Exercise 3 in this section. 
The rest of the proof is straightforward. 
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Figure 1.4 Partition of 
S determined by three 
events Ay, Ap, A3. 
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If A and B are events, then so is Af B. For all events A and B, 


ANB=BNA, ANA=A, ANAS =9, 
ANGB=9, ANS=A. 
Furthermore, if Ac B, then AN B= A. | 


The concept of intersection extends to more than two sets. 


Intersection of Many Sets. The intersection of n sets Aj,..., A, is defined to be the 
set that contains the elements that are common to all these n sets. The notation for 
this intersection is Ay A7N...9 A, or (}}_, A;. Similar notations are used for the 
intersection of an infinite sequence of sets or for the intersection of an arbitrary 
collection of sets. 


In terms of events, the intersection of a collection of events is the event that every 
event in the collection occurs. 

The following result concerning the intersection of three events is straightfor- 
ward to prove. 


Associative Property. For every three events A, B, and C, the following associative 
relations are satisfied: 


ANBNC=(ANB)NC=AN(BNC). : 


Disjoint/Mutually Exclusive. It is said that two sets A and B are disjoint, or mutually 
exclusive, if A and B have no outcomes in common, that is, if AN B =. The sets 
Ay,..., A, or the sets Aj, Az, ... are disjoint if for every i 4 j, we have that A; and 
A; are disjoint, that is, A; 1 A; = for alli 4 j. The events in an arbitrary collection 
are disjoint if no two events in the collection have any outcomes in common. 


In terms of events, A and B are disjoint if they cannot both occur. 

As an illustration of these concepts, a Venn diagram for three events A;, A, and 
A; is presented in Fig. 1.4. This diagram indicates that the various intersections of 
Ay, A>, and A3 and their complements will partition the sample space S into eight 
disjoint subsets. 


A,NASNAS 


ACN ASNAS 
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Example 1.4.5 


Tossing a Coin. Suppose that a coin is tossed three times. Then the sample space S 


contains the following eight possible outcomes 51, ..., 5g: 

s;: HHH, 
sy): THH, 
33) HTH, 
sg: HHT, 
ss. HITT, 
so: THT, 
sz. TTH, 
sg) TTT. 


In this notation, H indicates a head and T indicates a tail. The outcome s3, for 
example, is the outcome in which a head is obtained on the first toss, a tail is obtained 
on the second toss, and a head is obtained on the third toss. 

To apply the concepts introduced in this section, we shall define four events as 
follows: Let A be the event that at least one head is obtained in the three tosses; let 
B be the event that a head is obtained on the second toss; let C be the event that a 
tail is obtained on the third toss; and let D be the event that no heads are obtained. 
Accordingly, 


A = {8}, 52, 53, 545 55, 56, 87}, 
B = {81, 52, 54, 56}; 
C = {84, 85, 56, Sg} 
D = {sz}. 
Various relations among these events can be derived. Some of these relations 
are BCA, AS=D, BND=%, AUC=S, BNC = {5q, s6}, (BUC) = {83, 57}, and 
AN (BUC) = {54, 52, 54, 55, 56}. < 


Demands for Utilities. A contractor is building an office complex and needs to plan 
for water and electricity demand (sizes of pipes, conduit, and wires). After consulting 
with prospective tenants and examining historical data, the contractor decides that 
the demand for electricity will range somewhere between 1 million and 150 million 
kilowatt-hours per day and water demand will be between 4 and 200 (in thousands 
of gallons per day). All combinations of electrical and water demand are considered 
possible. The shaded region in Fig. 1.5 shows the sample space for the experiment, 
consisting of learning the actual water and electricity demands for the office complex. 
We can express the sample space as the set of ordered pairs {(x, y):4 <x <200,1< 
y < 150}, where x stands for water demand in thousands of gallons per day and y 


Electric 


150 +- 


| > Water 
0} 4 200 
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Figure 1.6 Partition of s 
AUB in Theorem 1.4.11. 


aoe 


stands for the electric demand in millions of kilowatt-hours per day. The types of sets 
that we want to call events include sets like 


{water demand is at least 100} = {(x, y):x > 100}, and 
{electric demand is no more than 35} = {(x, y): y < 35}, 


along with intersections, unions, and complements of such sets. This sample space 
has infinitely many points. Indeed, the sample space is uncountable. There are many 
more sets that are difficult to describe and which we will have no need to consider as 
events. < 


Additional Properties of Sets The proof of the following useful result is left to 
Exercise 3 in this section. 


Theorem De Morgan’s Laws. For every two sets A and B, 


1.4. 
ae (AU B)®=A°N BS and (ANB) =ASUBS. o 


The generalization of Theorem 1.4.9 is the subject of Exercise 5 in this section. 
The proofs of the following distributive properties are left to Exercise 2 in this 
section. These properties also extend in natural ways to larger collections of events. 


Theorem Distributive Properties. For every three sets A, B, and C, 


The following result is useful for computing probabilities of events that can be 
partitioned into smaller pieces. Its proof is left to Exercise 4 in this section, and is 
illuminated by Fig. 1.6. 


Theorem Partitioning a Set. For every two sets A and B, AN B and AN B* are disjoint and 
1.4.11 
: A=(ANB)U(ANB). 
In addition, B and AN B¢ are disjoint, and 


AUB=BU(ANB*). . 


Proof That the Real Numbers Are Uncountable 


We shall show that the real numbers in the interval [0, 1) are uncountable. Every 
larger set is a fortiori uncountable. For each number x € [0, 1), define the sequence 
{ay (x)}°° , as follows. First, a;(x) = [10x], where |y] stands for the greatest integer 
less than or equal to y (round nonintegers down to the closest integer below). Then 
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Figure 1.7 An array of a countable 
collection of sequences of digits with the 
diagonal underlined. 


set bj(x) = 10x — a(x), which will again be in [0, 1). For n > 1, a,(x) = [10b,_1(x) J 


and b,(x) = 10b,_1(x) — a,(x). It is easy to see that the sequence {a,(x)}"° , gives a 
decimal expansion for x in the form 
foe) 
x= Yo ay(x)10™. (1.4.1) 


n=1 


By construction, each number of the form x =k/10” for some nonnegative 
integers k and m will have a,(x) =0 for n > m. The numbers of the form k/10” 
are the only ones that have an alternate decimal expansion x = )°~, c,(x)10™. 
When & is not a multiple of 10, this alternate expansion satisfies c,(x) =a,(x) for 
n=1,...,m—1,c,,(x) =a,,(x) —1, andc, (x) =9 forn > m. Let C = {0,1,..., 9} 
stand for the set of all infinite sequences of digits. Let B denote the subset of C 
consisting of those sequences that don’t end in repeating 9’s. Then we have just 
constructed a function a from the interval [0, 1) onto B that is one-to-one and whose 
inverse is given in (1.4.1). We now show that the set B is uncountable, hence [0, 1) 
is uncountable. Take any countable subset of B and arrange the sequences into a 
rectangular array with the kth sequence running across the kth row of the array for 
k=1,2,.... Figure 1.7 gives an example of part of such an array. 

In Fig. 1.7, we have underlined the kth digit in the kth sequence for each k. This 
portion of the array is called the diagonal of the array. We now show that there must 
exist a sequence in B that is not part of this array. This will prove that the whole set 
B cannot be put into such an array, and hence cannot be countable. Construct the 
sequence {d,,}"° , as follows. For each n, let d,, = 2 if the nth digit in the nth sequence 
is 1, and d, = 1 otherwise. This sequence does not end in repeating 9’s; hence, it is 
in B. We conclude the proof by showing that {d,,}°° , does not appear anywhere in 
the array. If the sequence did appear in the array, say, in the kth row, then its kth 
element would be the kth diagonal element of the array. But we constructed the 
sequence so that for every n (including n =k), its nth element never matched the 
nth diagonal element. Hence, the sequence can’t be in the kth row, no matter what 
k is. The argument given here is essentially that of the nineteenth-century German 
mathematician Georg Cantor. 
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We will use set theory for the mathematical model of events. Outcomes of an exper- 
iment are elements of some sample space S, and each event is a subset of S. Two 
events both occur if the outcome is in the intersection of the two sets. At least one of 
a collection of events occurs if the outcome is in the union of the sets. Two events can- 
not both occur if the sets are disjoint. An event fails to occur if the outcome is in the 
complement of the set. The empty set stands for every event that cannot possibly oc- 
cur. The collection of events is assumed to contain the sample space, the complement 
of each event, and the union of each countable collection of events. 


Exercises 


1. Suppose that A Cc B. Show that B° Cc A°. 

2. Prove the distributive properties in Theorem 1.4.10. 
3. Prove De Morgan’s laws (Theorem 1.4.9). 

4. Prove Theorem 1.4.11. 

5. 


For every collection of events A; (i € J), show that 


(U 4) =() As and (N 4) =(./a7 
iel iel iel iel 

6. Suppose that one card is to be selected from a deck of 
20 cards that contains 10 red cards numbered from 1 to 
10 and 10 blue cards numbered from 1 to 10. Let A be 
the event that a card with an even number is selected, 
let B be the event that a blue card is selected, and let 
C be the event that a card with a number less than 5 is 
selected. Describe the sample space S and describe each 
of the following events both in words and as subsets of S: 


b. BNC® c AUBUC 
e. ASN BENC. 


a ANBNC 
d. AN(BUC) 
7. Suppose that a number x is to be selected from the real 
line S, and let A, B, and C be the events represented by the 
following subsets of S, where the notation {x: - - -} denotes 
the set containing every point x for which the property 
presented following the colon is satisfied: 
Az={x:1<x <5}, 
B= {x:3<x <7}, 
C = {x:x <0}. 
Describe each of the following events as a set of real 
numbers: 
a. AC 
d. ASN BNC 


b. AUB ce BOC 
e (AUB)NC. 
8. A simplified model of the human blood-type system 


has four blood types: A, B, AB, and O. There are two 
antigens, anti-A and anti-B, that react with a person’s 


blood in different ways depending on the blood type. Anti- 
A reacts with blood types A and AB, but not with B and 
O. Anti-B reacts with blood types B and AB, but not with 
A and O. Suppose that a person’s blood is sampled and 
tested with the two antigens. Let A be the event that the 
blood reacts with anti-A, and let B be the event that it 
reacts with anti-B. Classify the person’s blood type using 
the events A, B, and their complements. 


9. Let S be a given sample space and let Aj, Ap, ... be 
an infinite sequence of events. Forn = 1,2,..., let B, = 
Up, Ai and let C, =();2,, Ai- 
a. Show that By D By D>--- and that Cy CC) C::-. 
b. Show that an outcome in S belongs to the event 
(2, Bn if and only if it belongs to an infinite number 
of the events A,, A>,.... 


c. Show that an outcome in S belongs to the event 
Ue: C, if and only if it belongs to all the events 
Ay, Az, ... except possibly a finite number of those 
events. 


10. Three six-sided dice are rolled. The six sides of each 
die are numbered 1-6. Let A be the event that the first 
die shows an even number, let B be the event that the 
second die shows an even number, and let C be the event 
that the third die shows an even number. Also, for each 
i=1,...,6,let A; be the event that the first die shows the 
number i, let B; be the event that the second die shows 
the number 7, and let C; be the event that the third die 
shows the number /. Express each of the following events 
in terms of the named events described above: 


a. The event that all three dice show even numbers 

b. The event that no die shows an even number 

c. The event that at least one die shows an odd number 

d. The event that at most two dice show odd numbers 

e. The event that the sum of the three dices is no greater 
than 5 


11. A power cell consists of two subcells, each of which 
can provide from 0 to 5 volts, regardless of what the other 
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subcell provides. The power cell is functional if and only 
if the sum of the two voltages of the subcells is at least 6 
volts. An experiment consists of measuring and recording 
the voltages of the two subcells. Let A be the event that 
the power cell is functional, let B be the event that two 
subcells have the same voltage, let C be the event that the 
first subcell has a strictly higher voltage than the second 
subcell, and let D be the event that the power cell is 
not functional but needs less than one additional volt to 
become functional. 


a. Define a sample space S for the experiment as a set 
of ordered pairs that makes it possible for you to 
express the four sets above as events. 

b. Express each of the events A, B, C, and D as sets of 
ordered pairs that are subsets of S. 


c. Express the following set in terms of A, B, C, and/or 
D: {(x, y):x=yandx+y <5}. 


d. Express the following event in terms of A, B, C, 
and/or D: the event that the power cell is not func- 
tional and the second subcell has a strictly higher 
voltage than the first subcell. 


12. Suppose that the sample space S of some experiment 
is finite. Show that the collection of all subsets of S satisfies 
the three conditions required to be called the collection of 
events. 


13. Let S be the sample space for some experiment. Show 
that the collection of subsets consisting solely of S and ¢ 
satisfies the three conditions required in order to be called 
the collection of events. Explain why this collection would 
not be very interesting in most real problems. 


14. Suppose that the sample space S of some experiment 
is countable. Suppose also that, for every outcome s € S, 
the subset {s} is an event. Show that every subset of S must 
be an event. Hint: Recall the three conditions required of 
the collection of subsets of S that we call events. 


1.5 The Definition of Probability 


We begin with the mathematical definition of probability and then present some 
useful results that follow easily from the definition. 


Axioms and Basic Theorems 


In this section, we shall present the mathematical, or axiomatic, definition of proba- 
bility. In a given experiment, it is necessary to assign to each event A in the sample 
space S a number Pr(A) that indicates the probability that A will occur. In order to 
satisfy the mathematical definition of probability, the number Pr(A) that is assigned 
must satisfy three specific axioms. These axioms ensure that the number Pr(A) will 
have certain properties that we intuitively expect a probability to have under each 
of the various interpretations described in Sec. 1.2. 
The first axiom states that the probability of every event must be nonnegative. 


The second axiom states that if an event is certain to occur, then the probability 


Axiom For every event A, Pr(A) > 0. 
I 
of that event is 1. 
Axiom Pr(S) =1. 
2 


Before stating Axiom 3, we shall discuss the probabilities of disjoint events. If two 
events are disjoint, it is natural to assume that the probability that one or the other 
will occur is the sum of their individual probabilities. In fact, it will be assumed that 
this additive property of probability is also true for every finite collection of disjoint 
events and even for every infinite sequence of disjoint events. If we assume that this 
additive property is true only for a finite number of disjoint events, we cannot then be 
certain that the property will be true for an infinite sequence of disjoint events as well. 
However, if we assume that the additive property is true for every infinite sequence 
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1.5.1 
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Definition 


1.5.1 
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of disjoint events, then (as we shall prove) the property must also be true for every 
finite number of disjoint events. These considerations lead to the third axiom. 


For every infinite sequence of disjoint events Aj, Ao, ..., 
[oe [o,e) 
m(U 4 = > Pr(A;). 
i=1 i=1 


Rolling a Die. In Example 1.4.1, for each subset A of S = {1, 2, 3, 4, 5, 6}, let Pr(A) be 
the number of elements of A divided by 6. It is trivial to see that this satisfies the first 
two axioms. There are only finitely many distinct collections of nonempty disjoint 
events. It is not difficult to see that Axiom 3 is also satisfied by this example. < 


A Loaded Die. In Example 1.5.1, there are other choices for the probabilities of events. 
For example, if we believe that the die is loaded, we might believe that some sides 
have different probabilities of turning up. To be specific, suppose that we believe that 
6 is twice as likely to come up as each of the other five sides. We could set p; = 1/7 for 
i=1,2,3,4,5 and pg =2/7. Then, for each event A, define Pr(A) to be the sum of 
all p; such that i € A. For example, if A = {1, 3, 5}, then Pr(A) = py + p3 + ps = 3/7. 
It is not difficult to check that this also satisfies all three axioms. J 


We are now prepared to give the mathematical definition of probability. 


Probability. A probability measure, or simply a probability, on a sample space S is a 
specification of numbers Pr(A) for all events A that satisfy Axioms 1, 2, and 3. 


We shall now derive two important consequences of Axiom 3. First, we shall 
show that if an event is impossible, its probability must be 0. 


Pr(¥) = 0. 
Proof Consider the infinite sequence of events Aj, Az,... such that A; = for 
i=1,2,....In other words, each of the events in the sequence is just the empty set 


#. Then this sequence is a sequence of disjoint events, since § N 4 = %. Furthermore, 
U, A; = 9. Therefore, it follows from Axiom 3 that 


Pr) = m(U a) = )¢ Pr(Aj) = >) Pri). 
i=l i=l 


i=1 
This equation states that when the number Pr(¥) is added repeatedly in an infinite 
series, the sum of that series is simply the number Pr(#). The only real number with 
this property is zero. = 


We can now show that the additive property assumed in Axiom 3 for an infinite 
sequence of disjoint events is also true for every finite number of disjoint events. 


For every finite sequence of n disjoint events A;,..., A,, 
n n 
m(U 4) =) Pr(A)). 
i=1 i=1 


Proof Consider the infinite sequence of events A;, A>,..., im which Aj,..., A, 
are the n given disjoint events and A; = ¥ for i > n. Then the events in this infinite 
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Figure 1.8 B=AU(BNA‘) 
in the proof of Theorem 1.5.4. 


sequence are disjoint and )*°, A; = U}'_, 4;. Therefore, by Axiom 3, 
n oo CO 
m(U 4) = m(U 4) = >> Pr(A;) 
i=1 i=1 i=l 


=> > Pr(Aj)+ >> Pr(A;) 
i=l 


i=n+1 


= 5) Pr(A;) +0 


i=l 


= > Pr(A;). = 
i=l 


Further Properties of Probability 


From the axioms and theorems just given, we shall now derive four other general 
properties of probability measures. Because of the fundamental nature of these four 
properties, they will be presented in the form of four theorems, each one of which is 
easily proved. 


For every event A, Pr(A‘°) = 1 — Pr(A). 

Proof Since A and A® are disjoint events and A U A° = S, it follows from Theo- 
rem 1.5.2 that Pr($) = Pr(A) + Pr(A‘). Since Pr($) = 1 by Axiom 2, then Pr(A‘) = 
1— Pr(A). o 
If A Cc B, then Pr(A) < Pr(B). 

Proof As illustrated in Fig. 1.8, the event B may be treated as the union of the 
two disjoint events A and BN A‘. Therefore, Pr(B) = Pr(A) + Pr(B N A‘). Since 
Pr(B 1M A‘) = 0, then Pr(B) > Pr(A). | 


For every event A, 0 < Pr(A) <1. 


Proof It is known from Axiom 1 that Pr(A) > 0. Since A C S for every event A, 
Theorem 1.5.4 implies Pr(A) < Pr(S) = 1, by Axiom 2. | 


For every two events A and B, 


Pr(A Q B®) = Pr(A) — Pr(A/N B). 


S 


BNA 
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Proof According to Theorem 1.4.11, the events AM BS and AN B are disjoint and 
A=(ANB)U(ANB). 
It follows from Theorem 1.5.2 that 
Pr(A) = Pr(AN B) + Pr(AN B*). 


Subtract Pr(A M B) from both sides of this last equation to complete the proof. 


For every two events A and B, 


Pr(A U B) = Pr(A) + Pr(B) — Pr(A NB). (1.5.1) 


Proof From Theorem 1.4.11, we have 
AUB=BU(ANB‘), 
and the two events on the right side of this equation are disjoint. Hence, we have 


Pr(A U B) = Pr(B) + Pr(AN B*) 
= Pr(B) + Pr(A) — Pr(AN B), 


where the first equation follows from Theorem 1.5.2, and the second follows from 
Theorem 1.5.6. a 


Diagnosing Diseases. A patient arrives at a doctor’s office with a sore throat and low- 
grade fever. After an exam, the doctor decides that the patient has either a bacterial 
infection or a viral infection or both. The doctor decides that there is a probability of 
0.7 that the patient has a bacterial infection and a probability of 0.4 that the person 
has a viral infection. What is the probability that the patient has both infections? 

Let B be the event that the patient has a bacterial infection, and let V be the 
event that the patient has a viral infection. We are told Pr(B) = 0.7, that Pr(V) = 0.4, 
and that S = B UV. Weare asked to find Pr(B N V). We will use Theorem 1.5.7, which 
says that 


Pr(B UV) = Pr(B) + Pr(V) — Pr(BNV). (1.5.2) 


Since § = B UV, the left-hand side of (1.5.2) is 1, while the first two terms on the 
right-hand side are 0.7 and 0.4. The result is 


1=0.7+04-—Pr(BnV), 


which leads to Pr(B N V) = 0.1, the probability that the patient has both infections. 
< 


Demands for Utilities. Consider, once again, the contractor who needs to plan for 
water and electricity demands in Example 1.4.5. There are many possible choices 
for how to spread the probability around the sample space (pictured in Fig. 1.5 on 
page 12). One simple choice is to make the probability of an event E proportional to 
the area of E. The area of S (the sample space) is (150 — 1) x (200 — 4) = 29,204, 
so Pr(E) equals the area of E divided by 29,204. For example, suppose that the 
contractor is interested in high demand. Let A be the set where water demand is 
at least 100, and let B be the event that electric demand is at least 115, and suppose 
that these values are considered high demand. These events are shaded with different 
patterns in Fig. 1.9. The area of A is (150 — 1) x (200 — 100) = 14,900, and the area 
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Figure 1.9 The two events 
of interest in utility demand 
sample space for Exam- 
ple 1.5.4. 


Theorem 
1.5.8 


Electric 


| t t > Water 


of B is (150 — 115) x (200 — 4) = 6,860. So, 


= aoe =0.5102, Pr(B)= ono! 
29,204 29,204 

The two events intersect in the region denoted by AN B. The area of this region 

is (150 — 115) x (200 — 100) = 3,500, so Pr(A N B) = 3,500/29,204 = 0.1198. If the 

contractor wishes to compute the probability that at least one of the two demands 

will be high, that probability is 


Pr(A U B) = Pr(A) + Pr(B) — Pr(AN B) = 0.5102 + 0.2349 — 0.1198 = 0.6253, 


Pr(A) — 0.2349. 


according to Theorem 1.5.7. 4 


The proof of the following useful result is left to Exercise 13. 


Bonferroni Inequality. For all events Aj,..., A,, 
n n n n 
(U 4 < > Pr(A;) and aa 4) oi= > Pr(A‘). 
i=1 i=l i=l i=1 
(The second inequality above is known as the Bonferroni inequality.) r 


Note: Probability Zero Does Not Mean Impossible. When an event has probability 
0, it does not mean that the event is impossible. In Example 1.5.4, there are many 
events with 0 probability, but they are not all impossible. For example, for every x, the 
event that water demand equals x corresponds to a line segment in Fig. 1.5. Since line 
segments have 0 area, the probability of every such line segment is 0, but the events 
are not all impossible. Indeed, if every event of the form {water demand equals x} 
were impossible, then water demand could not take any value at all. If e > 0, the 
event 


{water demand is between x — € and x + €} 


will have positive probability, but that probability will go to 0 as € goes to 0. 


Summary 


We have presented the mathematical definition of probability through the three 
axioms. The axioms require that every event have nonnegative probability, that the 
whole sample space have probability 1, and that the union of an infinite sequence 
of disjoint events have probability equal to the sum of their probabilities. Some 
important results to remember include the following: 


e If Aj, c 
e Pr(AS) =1-— Pr(A). 
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.., Ay are disjoint, Pr (U*_,A;) = D*_, Pr(A)). 


¢ AC B implies that Pr(A) < Pr(B). 
e Pr(A U B) = Pr(A) + Pr(B) — Pr(AN B). 


It does not matter how the probabilities were determined. As long as they satisfy the 
three axioms, they must also satisfy the above relations as well as all of the results 
that we prove later in the text. 


Exercises 


1. One ball is to be selected from a box containing red, 
white, blue, yellow, and green balls. If the probability that 
the selected ball will be red is 1/5 and the probability that 
it will be white is 2/5, what is the probability that it will be 
blue, yellow, or green? 


2. A student selected from a class will be either a boy or 
a girl. If the probability that a boy will be selected is 0.3, 
what is the probability that a girl will be selected? 


3. Consider two events A and B such that Pr(A) = 1/3 
and Pr(B) = 1/2. Determine the value of Pr(B NM A‘) for 
each of the following conditions: (a) A and B are disjoint; 
(b) A C B; (ce) Pr(AN B) = 1/8. 


4. If the probability that student A will fail a certain statis- 
tics examination is 0.5, the probability that student B will 
fail the examination is 0.2, and the probability that both 
student A and student B will fail the examination is 0.1, 
what is the probability that at least one of these two stu- 
dents will fail the examination? 


5. For the conditions of Exercise 4, what is the probability 
that neither student A nor student B will fail the examina- 
tion? 


6. For the conditions of Exercise 4, what is the probability 
that exactly one of the two students will fail the examina- 
tion? 


7. Consider two events A and B with Pr(A) = 0.4 and 
Pr(B) = 0.7. Determine the maximum and minimum pos- 
sible values of Pr(A M B) and the conditions under which 
each of these values is attained. 


8. If 50 percent of the families in a certain city subscribe 
to the morning newspaper, 65 percent of the families sub- 
scribe to the afternoon newspaper, and 85 percent of the 
families subscribe to at least one of the two newspapers, 
what percentage of the families subscribe to both newspa- 
pers? 


9. Prove that for every two events A and B, the probability 
that exactly one of the two events will occur is given by the 
expression 


Pr(A) + Pr(B) —2 Pr(AN B). 
10. For two arbitrary events A and B, prove that 
Pr(A) = Pr(AN B) + Pr(AN B*). 


11. A point (x, y) is to be selected from the square S 
containing all points (x, y) such thatO <x <landO<y< 
1. Suppose that the probability that the selected point will 
belong to each specified subset of S is equal to the area of 
that subset. Find the probability of each of the following 
subsets: (a) the subset of points such that (x — 5) +(y- 
5)" > i (b) the subset of points such that 5 <x+y< 3; 
(c) the subset of points such that y < 1 — x; (d) the subset 
of points such that x = y. 


12. Let Aj, A>,... be an arbitrary infinite sequence of 
events, and let By, B,... be another infinite sequence 
of events defined as follows: By = Ay, By = AJM Az, B3 = 
AYO ASO A3, By = ALN ASM ASN Ag, .... Prove that 


»(U 4) — 5 Pr(B;) forn =1,2,..., 
i=1 


i=] 


and that 


(U 4) = 3 Pr(B;). 
i=1 i=1 


13. Prove Theorem 1.5.8. Hint: Use Exercise 12. 


14. Consider, once again, the four blood types A, B, AB, 
and O described in Exercise 8 in Sec. 1.4 together with 
the two antigens anti-A and anti-B. Suppose that, for a 
given person, the probability of type O blood is 0.5, the 
probability of type A blood is 0.34, and the probability of 
type B blood is 0.12. 


a. Find the probability that each of the antigens will 
react with this person’s blood. 

b. Find the probability that both antigens will react with 
this person’s blood. 
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1.6 Finite Sample Spaces 


The simplest experiments in which to determine and derive probabilities are those 
that involve only finitely many possible outcomes. This section gives several ex- 
amples to illustrate the important concepts from Sec. 1.5 in finite sample spaces. 


Current Population Survey. Every month, the Census Bureau conducts a survey of 
the United States population in order to learn about labor-force characteristics. 
Several pieces of information are collected on each of about 50,000 households. 
One piece of information is whether or not someone in the household is actively 
looking for employment but currently not employed. Suppose that our experiment 
consists of selecting three households at random from the 50,000 that were surveyed 
in a particular month and obtaining access to the information recorded during the 
survey. (Due to the confidential nature of information obtained during the Current 
Population Survey, only researchers in the Census Bureau would be able to perform 
the experiment just described.) The outcomes that make up the sample space S for 
this experiment can be described as lists of three three distinct numbers from 1 to 
50,000. For example (300, 1, 24602) is one such list where we have kept track of the 
order in which the three households were selected. Clearly, there are only finitely 
many such lists. We can assume that each list is equally likely to be chosen, but we 
need to be able to count how many such lists there are. We shall learn a method for 
counting the outcomes for this example in Sec. 1.7. < 


Requirements of Probabilities 


In this section, we shall consider experiments for which there are only a finite number 
of possible outcomes. In other words, we shall consider experiments for which the 
sample space S contains only a finite number of points 51, .. . , s,. In an experiment of 
this type, a probability measure on S is specified by assigning a probability p; to each 
point s; € S. The number p; is the probability that the outcome of the experiment 


will be s; @ =1,...,7). In order to satisfy the axioms of probability, the numbers 
P1,--+-+s Py, Must satisfy the following two conditions: 

p20 fori=1,...,n 
and 


n 
> pji=l. 
i=l 


The probability of each event A can then be found by adding the probabilities p; of 
all outcomes s; that belong to A. This is the general version of Example 1.5.2. 


Fiber Breaks. Consider an experiment in which five fibers having different lengths are 
subjected to a testing process to learn which fiber will break first. Suppose that the 
lengths of the five fibers are 1, 2, 3, 4, and 5 inches, respectively. Suppose also that 
the probability that any given fiber will be the first to break is proportional to the 
length of that fiber. We shall determine the probability that the length of the fiber 
that breaks first is not more than 3 inches. 

In this example, we shall let s; be the outcome in which the fiber whose length is 
i inches breaks first (@ =1,..., 5). Then S = {s,,..., 55} and p; =ai fori=1,...,5, 
where a is a proportionality factor. It must be true that p; +---+ ps; =1, and we 
know that pj +---+ ps5 = 15a, soa = 1/15. If A is the event that the length of the 
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fiber that breaks first is not more than 3 inches, then A = {51, 59, s3}. Therefore, 


1 2 3 2, 
Pr(A) =p ++ Pp3=—wt+at+a=-e: < 


Simple Sample Spaces 


A sample space S containing n outcomes sj, ..., 5, is called a simple sample space 
if the probability assigned to each of the outcomes s;,..., 5, is 1/n. If an event A in 
this simple sample space contains exactly m outcomes, then 


Pr(A) =~. 
n 


Tossing Coins. Suppose that three fair coins are tossed simultaneously. We shall 
determine the probability of obtaining exactly two heads. 

Regardless of whether or not the three coins can be distinguished from each 
other by the experimenter, it is convenient for the purpose of describing the sample 
space to assume that the coins can be distinguished. We can then speak of the result 
for the first coin, the result for the second coin, and the result for the third coin; and 
the sample space will comprise the eight possible outcomes listed in Example 1.4.4 
on page 12. 

Furthermore, because of the assumption that the coins are fair, it is reasonable 
to assume that this sample space is simple and that the probability assigned to each 
of the eight outcomes is 1/8. As can be seen from the listing in Example 1.4.4, exactly 
two heads will be obtained in three of these outcomes. Therefore, the probability of 
obtaining exactly two heads is 3/8. < 


It should be noted that if we had considered the only possible outcomes to be 
no heads, one head, two heads, and three heads, it would have been reasonable to 
assume that the sample space contained just these four outcomes. This sample space 
would not be simple because the outcomes would not be equally probable. 


Genetics. Inherited traits in humans are determined by material in specific locations 
on chromosomes. Each normal human receives 23 chromosomes from each parent, 
and these chromosomes are naturally paired, with one chromosome in each pair 
coming from each parent. For the purposes of this text, it is safe to think of a gene 
as a portion of each chromosome in a pair. The genes, either one at a time or in 
combination, determine the inherited traits, such as blood type and hair color. The 
material in the two locations that make up a gene on the pair of chromosomes 
comes in forms called alleles. Each distinct combination of alleles (one on each 
chromosome) is called a genotype. 

Consider a gene with only two different alleles A and a. Suppose that both 
parents have genotype Aa, that is, each parent has allele A on one chromosome 
and allele a on the other. (We do not distinguish the same alleles in a different order 
as a different genotype. For example, aA would be the same genotype as Aa. But it 
can be convenient to distinguish the two chromosomes during intermediate steps in 
probability calculations, just as we distinguished the three coins in Example 1.6.3.) 
What are the possible genotypes of an offspring of these two parents? If all possible 
results of the parents contributing pairs of alleles are equally likely, what are the 
probabilities of the different genotypes? 

To begin, we shall distinguish which allele the offspring receives from each 
parent, since we are assuming that pairs of contributed alleles are equally likely. 
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Afterward, we shall combine those results that produce the same genotype. The 
possible contributions from the parents are: 


Mother 
Father A a 
A AA Aa 
a aA aa 


So, there are three possible genotypes AA, Aa, and aa for the offspring. Since we 
assumed that every combination was equally likely, the four cells in the table all 
have probability 1/4. Since two of the cells in the table combined into genotype Aa, 
that genotype has probability 1/2. The other two genotypes each have probability 
1/4, since they each correspond to only one cell in the table. < 


Rolling Two Dice. We shall now consider an experiment in which two balanced dice 
are rolled, and we shall calculate the probability of each of the possible values of the 
sum of the two numbers that may appear. 

Although the experimenter need not be able to distinguish the two dice from 
one another in order to observe the value of their sum, the specification of a simple 
sample space in this example will be facilitated if we assume that the two dice are 
distinguishable. If this assumption is made, each outcome in the sample space S can 
be represented as a pair of numbers (x, y), where x is the number that appears on the 
first die and y is the number that appears on the second die. Therefore, § comprises 
the following 36 outcomes: 


dg, d,2) d,3) d,4) G5) @d,6) 
(2,1) (2,2) (2,3) (@,4 (2,5) (@,6) 
3,1) 3,2) 63,3) 3,4) 3,5) 3,6) 
4,1) (4,2) 43) 44 45 46 
(5,1) (5,2) (5,3) (5,4) (6,5) (6,6) 
(6,1) (6,2) (6,3) (6,4) (6,5) (6,6) 


It is natural to assume that S is a simple sample space and that the probability of each 
of these outcomes is 1/36. 

Let P; denote the probability that the sum of the two numbers is i for i = 
2,3,..., 12. The only outcome in S for which the sum is 2 is the outcome (1, 1). 
Therefore, P; = 1/36. The sum will be 3 for either of the two outcomes (1, 2) and (2, 1). 
Therefore, P; = 2/36 = 1/18. By continuing in this manner, we obtain the following 
probability for each of the possible values of the sum: 


1 4 
Daa toe 
5 
sm a aa 
6 


36 36 


Summary 
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A simple sample space is a finite sample space S such that every outcome in S has the 
same probability. If there are n outcomes in a simple sample space S, then each one 
must have probability 1/n. The probability of an event E in a simple sample space is 
the number of outcomes in E divided by n. In the next three sections, we will present 
some useful methods for counting numbers of outcomes in various events. 


Exercises 


1. If two balanced dice are rolled, what is the probability 
that the sum of the two numbers that appear will be odd? 


2. If two balanced dice are rolled, what is the probability 
that the sum of the two numbers that appear will be even? 


3. If two balanced dice are rolled, what is the probability 
that the difference between the two numbers that appear 
will be less than 3? 


4. A school contains students in grades 1, 2, 3, 4, 5, and 
6. Grades 2, 3, 4, 5, and 6 all contain the same number of 
students, but there are twice this number in grade 1. Ifa 
student is selected at random from a list of all the students 
in the school, what is the probability that she will be in 
grade 3? 


5. For the conditions of Exercise 4, what is the probabil- 
ity that the selected student will be in an odd-numbered 
grade? 


6. If three fair coins are tossed, what is the probability that 
all three faces will be the same? 


7. Consider the setup of Example 1.6.4 on page 23. This 
time, assume that two parents have genotypes Aa and aa. 
Find the possible genotypes for an offspring and find the 
probabilities for each genotype. Assume that all possi- 
ble results of the parents contributing pairs of alleles are 
equally likely. 


8. Consider an experiment in which a fair coin is tossed 
once and a balanced die is rolled once. 


a. Describe the sample space for this experiment. 


b. What is the probability that a head will be obtained 
on the coin and an odd number will be obtained on 
the die? 


1.7 Counting Methods 


In simple sample spaces, one way to calculate the probability of an event involves 
counting the number of outcomes in the event and the number of outcomes in 
the sample space. This section presents some common methods for counting the 
number of outcomes in a set. These methods rely on special structure that exists in 
many common experiments, namely, that each outcome consists of several parts 
and that it is relatively easy to count how many possibilities there are for each of 


the parts. 


We have seen that in a simple sample space S, the probability of an event A is the 
ratio of the number of outcomes in A to the total number of outcomes in S. In many 
experiments, the number of outcomes in S is so large that a complete listing of these 
outcomes is too expensive, too slow, or too likely to be incorrect to be useful. In such 
an experiment, it is convenient to have a method of determining the total number 
of outcomes in the space S and in various events in S without compiling a list of all 
these outcomes. In this section, some of these methods will be presented. 
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Figure 1.10 Three cities 


with routes between them in 


Example 1.7.1. 


Example 
1.7.1 


Example 
1.7.2 


Theorem 
1.7.1 


Example 
1.7.3 


Multiplication Rule 


Routes between Cities. Suppose that there are three different routes from city A to 
city B and five different routes from city B to city C. The cities and routes are depicted 
in Fig. 1.10, with the routes numbered from 1 to 8. We wish to count the number of 
different routes from A to C that pass through B. For example, one such route from 
Fig. 1.10 is 1 followed by 4, which we can denote (1, 4). Similarly, there are the routes 
(1, 5), (1, 6),..., G, 8). It is not difficult to see that the number of different routes 
3x5=15. | 


Example 1.7.1 is a special case of a common form of experiment. 


Experiment in Two Parts. Consider an experiment that has the following two charac- 
teristics: 


i. The experiment is performed in two parts. 


ii. The first part of the experiment has m possible outcomes xj, ..., x,,, and, 
regardless of which one of these outcomes x; occurs, the second part of the 
experiment has n possible outcomes yj, ..., Yp- 


Each outcome in the sample space S of such an experiment will therefore be a pair 
having the form (x;, y;), and S will be composed of the following pairs: 

(x1, Yi) (1, Ya) +++ 1s Yn) 

(X2, V1) (2, Y2) +++ G2» Yn) 


(Xm> YI Om> Y2) +++ Ome Yn)- < 


Since each of the m rows in the array in Example 1.7.2 contains n pairs, the 
following result follows directly. 


Multiplication Rule for Two-Part Experiments. In an experiment of the type described 
in Example 1.7.2, the sample space S contains exactly mn outcomes. o 


Figure 1.11 illustrates the multiplication rule for the case of n =3 and m =2 witha 
tree diagram. Each end-node of the tree represents an outcome, which is the pair 
consisting of the two parts whose names appear along the branch leading to the end- 
node. 


Rolling Two Dice. Suppose that two dice are rolled. Since there are six possible 
outcomes for each die, the number of possible outcomes for the experiment is 
6 x 6 = 36, as we saw in Example 1.6.5. < 


The multiplication rule can be extended to experiments with more than two parts. 


Figure 1.11 Tree diagram 
in which end-nodes represent 


outcomes. 
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(yp) 
a2 (1, Yo) 
(x1, y3) 
(2, Y1) 
(X, Y2) 


(x2, 3) 


Multiplication Rule. Suppose that an experiment has k parts (k > 2), that the ith 
part of the experiment can have n; possible outcomes (i = 1,..., 4), and that all 
of the outcomes in each part can occur regardless of which specific outcomes have 
occurred in the other parts. Then the sample space S of the experiment will contain 


all vectors of the form (uy, ..., uz), where u; is one of the n; possible outcomes of part 
i (i =1,...,k). The total number of these vectors in S will be equal to the product 
nyng +++ Ng. a 


Tossing Several Coins. Suppose that we toss six coins. Each outcome in S will consist 
of a sequence of six heads and tails, such as HTTHHH. Since there are two possible 
outcomes for each of the six coins, the total number of outcomes in S will be 2° = 64. 
If head and tail are considered equally likely for each coin, then S$ will be a simple 
sample space. Since there is only one outcome in S with six heads and no tails, the 
probability of obtaining heads on all six coins is 1/64. Since there are six outcomes 
in S with one head and five tails, the probability of obtaining exactly one head is 
6/64 = 3/32. < 


Combination Lock. A standard combination lock has a dial with tick marks for 40 
numbers from 0 to 39. The combination consists of a sequence of three numbers that 
must be dialed in the correct order to open the lock. Each of the 40 numbers may 
appear in each of the three positions of the combination regardless of what the other 
two positions contain. It follows that there are 40° = 64,000 possible combinations. 
This number is supposed to be large enough to discourage would-be thieves from 
trying every combination. <l 


Note: The Multiplication Rule Is Slightly More General. In the statements of The- 
orems 1.7.1 and 1.7.2, it is assumed that each possible outcome in each part of the 
experiment can occur regardless of what occurs in the other parts of the experiment. 
Technically, all that is necessary is that the number of possible outcomes for each 
part of the experiment not depend on what occurs on the other parts. The discussion 
of permutations below is an example of this situation. 


Permutations 


Sampling without Replacement. Consider an experiment in which a card is selected 
and removed from a deck of n different cards, a second card is then selected and 
removed from the remaining n — 1 cards, and finally a third card is selected from the 
remaining n — 2 cards. Each outcome consists of the three cards in the order selected. 
A process of this kind is called sampling without replacement, since a card that is 
drawn is not replaced in the deck before the next card is selected. In this experiment, 
any one of the n cards could be selected first. Once this card has been removed, any 
one of the other n — 1 cards could be selected second. Therefore, there are n(n — 1) 
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possible outcomes for the first two selections. Finally, for every given outcome of 
the first two selections, there are n — 2 other cards that could possibly be selected 
third. Therefore, the total number of possible outcomes for all three selections is 
n(n — 1)(n — 2). < 


The situation in Example 1.7.6 can be generalized to any number of selections 
without replacement. 


Permutations. Suppose that a set has n elements. Suppose that an experiment consists 
of selecting k of the elements one at a time without replacement. Let each outcome 
consist of the k elements in the order selected. Each such outcome is called a per- 
mutation of n elements taken k at a time. We denote the number of distinct such 
permutations by the symbol P,, ¢. 


By arguing as in Example 1.7.6, we can figure out how many different permutations 
there are of n elements taken k at a time. The proof of the following theorem is simply 
to extend the reasoning in Example 1.7.6 to selecting k cards without replacement. 
The proof is left to the reader. 


Number of Permutations. The number of permutations of n elements taken k at a time 
is P, ,=n(n—1)---(n—k +1). | 


Current Population Survey. Theorem 1.7.3 allows us to count the number of points in 
the sample space of Example 1.6.1. Each outcome in S consists of a permutation of 
n = 50,000 elements taken k = 3 at a time. Hence, the sample space S in that example 
consisits of 


50,000 x 49,999 x 49,998 = 1.25 x 10!4 


outcomes. | 


When k =n, the number of possible permutations will be the number P,, ,, of 
different permutations of all n cards. It is seen from the equation just derived that 


Py,» =n(n—1)---1l=n! 


The symbol n! is read n factorial. In general, the number of permutations of n differ- 
ent items is n!. 

The expression for P, , can be rewritten in the following alternate form for 
k=1,...,n—-1: 
(n—k)n—k—1)---1_ an 
a=Hn=k=—D-~-1 HB 
Here and elsewhere in the theory of probability, it is convenient to define 0! by the 
relation 


Py p=n(n—1)---a—-k+)) 


O!=1. 
With this definition, it follows that the relation P, , =n!/(n — k)! will be correct for 
the value k =n as well as for the values k = 1,..., — 1. To summarize: 


Permutations. The number of distinct orderings of k items selected without replace- 
ment from a collection of n different items (0 < k <n) is 
n} 


| ne ae 
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Choosing Officers. Suppose that a club consists of 25 members and that a president 
and a secretary are to be chosen from the membership. We shall determine the total 
possible number of ways in which these two positions can be filled. 

Since the positions can be filled by first choosing one of the 25 members to be 
president and then choosing one of the remaining 24 members to be secretary, the 
possible number of choices is P59 = (25)(24) = 600. < 


Arranging Books. Suppose that six different books are to be arranged on a shelf. The 
number of possible permutations of the books is 6! = 720. < 


Sampling with Replacement. Consider a box that contains n balls numbered 1, ..., 7. 
First, one ball is selected at random from the box and its number is noted. This ball 
is then put back in the box and another ball is selected (it is possible that the same 
ball will be selected again). As many balls as desired can be selected in this way. 
This process is called sampling with replacement. It is assumed that each of the n 
balls is equally likely to be selected at each stage and that all selections are made 
independently of each other. 

Suppose that a total of k selections are to be made, where k is a given positive 
integer. Then the sample space S of this experiment will contain all vectors of the form 
(x1, ..., X;), Where x; is the outcome of the ith selection (i = 1, ..., k). Since there 
are n possible outcomes for each of the k selections, the total number of vectors in S$ 
is n*. Furthermore, from our assumptions it follows that S is a simple sample space. 
Hence, the probability assigned to each vector in S is 1/n*. 4 


Obtaining Different Numbers. For the experiment in Example 1.7.10, we shall deter- 
mine the probability of the event E that each of the k balls that are selected will have 
a different number. 

If k >n, it is impossible for all the selected balls to have different numbers be- 
cause there are only n different numbers. Suppose, therefore, that k <n. The number 
of outcomes in the event E is the number of vectors for which all k components are 
different. This equals P,, ,, since the first component x, of each vector can have n pos- 
sible values, the second component x, can then have any one of the other n — 1 values, 
and so on. Since S is a simple sample space containing n* vectors, the probability p 
that & different numbers will be selected is 


Paik = n} 
nko (n —k) Ink 


Note: Using Two Different Methods in the Same Problem. Example 1.7.11 illus- 
trates a combination of techniques that might seem confusing at first. The method 
used to count the number of outcomes in the sample space was based on sampling 
with replacement, since the experiment allows repeat numbers in each outcome. The 
method used to count the number of outcomes in the event E was permutations (sam- 
pling without replacement) because E consists of those outcomes without repeats. It 
often happens that one needs to use different methods to count the numbers of out- 
comes in different subsets of the sample space. The birthday problem, which follows, 
is another example in which we need more than one counting method in the same 
problem. 
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The Birthday Problem 


In the following problem, which is often called the birthday problem, it is required to 
determine the probability p that at least two people in a group of k people will have 
the same birthday, that is, will have been born on the same day of the same month but 
not necessarily in the same year. For the solution presented here, we assume that the 
birthdays of the k people are unrelated (in particular, we assume that twins are not 
present) and that each of the 365 days of the year is equally likely to be the birthday 
of any person in the group. In particular, we ignore the fact that the birth rate actually 
varies during the year and we assume that anyone actually born on February 29 will 
consider his birthday to be another day, such as March 1. 

When these assumptions are made, this problem becomes similar to the one 
in Example 1.7.11. Since there are 365 possible birthdays for each of k people, the 
sample space S will contain 365* outcomes, all of which will be equally probable. If 
k > 365, there are not enough birthdays for every one to be different, and hence at 
least two people must have the same birthday. So, we assume that k < 365. Counting 
the number of outcomes in which at least two birthdays are the same is tedious. 
However, the number of outcomes in S for which all k birthdays will be different is 
P365, x, Since the first person’s birthday could be any one of the 365 days, the second 
person’s birthday could then be any of the other 364 days, and so on. Hence, the 
probability that all k persons will have different birthdays is 


P2365, k 
365k © 


The probability p that at least two of the people will have the same birthday is 
therefore 


P365, k 4 (365)! 
365* (365 — k)!365* 


p= 


Numerical values of this probability p for various values of k are given in Table 1.1. 
These probabilities may seem surprisingly large to anyone who has not thought about 
them before. Many persons would guess that in order to obtain a value of p greater 
than 1/2, the number of people in the group would have to be about 100. However, 
according to Table 1.1, there would have to be only 23 people in the group. As a 
matter of fact, for k = 100 the value of p is 0.9999997. 


Table 1.1 The probability p that at least two 
people in a group of k people will 
have the same birthday 


k P k Pp 

5 0.027 25 0.569 
10 0.117 30 0.706 
ie 0.253 40 0.891 
20 0.411 50 0.970 
22 0.476 60 0.994 


Stirling’s Formula 
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The calculation in this example illustrates a common technique for solving prob- 
ability problems. If one wishes to compute the probability of some event A, it might 
be more straightforward to calculate Pr(A‘°) and then use the fact that Pr(A) = 
1 — Pr(A‘). This idea is particularly useful when the event A is of the form “at least 
n things happen” where n is small compared to how many things could happen. 


Theorem 
1.7.5 


Example 
1.7.12 


For large values of n, it is nearly impossible to compute n!. For n > 70, n! > 101° 
and cannot be represented on many scientific calculators. In most cases for which 
n! is needed with a large value of n, one only needs the ratio of n! to another large 
number a,. A common example of this is P,, , with large n and not so large k, which 
equals n!/(n — k)!. In such cases, we can notice that 


ni = elogin !)—log(a,) 


ay 
Compared to computing n!, it takes a much larger n before log(n!) becomes difficult 
to represent. Furthermore, if we had a simple approximation s, to log(n!) such that 
lim), +0 |S, — log(n!)| = 0, then the ratio of n!/a, to s,/a, would be close to 1 for large 
n. The following result, whose proof can be found in Feller (1968), provides such an 
approximation. 


Stirling’s Formula. Let 


= : log) + (x + 7) log(n) —n. 


Then lim,_, 49 |s, — log(n!)| = 0. Put another way, 


1/2,,n+1/2,—n 
jig Ss: 7 
noo ni 


Approximating the Number of Permutations. Suppose that we want to compute Py 99 = 
70!/50!. The approximation from Stirling’s formula is 
1. xy e 


_ 35 
soi © cpp liggsos—=s0 = 3940 x 10% 


The exact calculation yields 3.938 x 10*>. The approximation and the exact calcula- 
tion differ by less than 1/10 of 1 percent. <1 


>, 
“9 


Summary 


Suppose that the following conditions are met: 


e Each element of a set consists of k distinguishable parts x1, ..., xx. 

¢ There are n, possibilities for the first part x. 

¢ Foreachi =2,...,kandeachcombination (x1, ..., x;_;) of the firsti — 1 parts, 
there are n,; possibilities for the ith part x;. 


Under these conditions, there are nj, --- n;, elements of the set. The third condition 
requires only that the number of possibilities for x; be n; no matter what the earlier 
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parts are. For example, for i = 2, it does not require that the same nz possibilities 
be available for x» regardless of what x; is. It only requires that the number of 
possibilities for x, be n> no matter what x, is. In this way, the general rule includes the 
multiplication rule, the calculation of permutations, and sampling with replacement 
as special cases. For permutations of m items k at a time, we have n; =m — i + 1 for 
i=1,...,k, and the n; possibilities for part i are just the n; items that have not yet 
appeared in the first i — 1 parts. For sampling with replacement from m items, we 
have n; = m for alli, and the m possibilities are the same for every part. In the next 
section, we shall consider how to count elements of sets in which the parts of each 
element are not distinguishable. 


Exercises 


1. Each year starts on one of the seven days (Sunday 
through Saturday). Each year is either a leap year (i.e., 
it includes February 29) or not. How many different cal- 
endars are possible for a year? 


2. Three different classes contain 20, 18, and 25 students, 
respectively, and no student is a member of more than one 
class. If a team is to be composed of one student from each 
of these three classes, in how many different ways can the 
members of the team be chosen? 


3. In how many different ways can the five letters a, b, c, 
d, and e be arranged? 


4. If aman has six different sportshirts and four different 
pairs of slacks, how many different combinations can he 
wear? 


5. If four dice are rolled, what is the probability that each 
of the four numbers that appear will be different? 


6. If six dice are rolled, what is the probability that each 
of the six different numbers will appear exactly once? 


7. If 12 balls are thrown at random into 20 boxes, what 
is the probability that no box will receive more than one 
ball? 


8. An elevator in a building starts with five passengers 
and stops at seven floors. If every passenger is equally 
likely to get off at each floor and all the passengers leave 
independently of each other, what is the probability that 
no two passengers will get off at the same floor? 


9. Suppose that three runners from team A and three run- 
ners from team B participate in a race. If all six runners 
have equal ability and there are no ties, what is the prob- 
ability that the three runners from team A will finish first, 
second, and third, and the three runners from team B will 
finish fourth, fifth, and sixth? 


10. A box contains 100 balls, of which r are red. Suppose 
that the balls are drawn from the box one at a time, at ran- 
dom, without replacement. Determine (a) the probability 
that the first ball drawn will be red; (b) the probability that 
the 50th ball drawn will be red; and (c) the probability that 
the last ball drawn will be red. 


11. Let n and k be positive integers such that both n and 
n —k are large. Use Stirling’s formula to write as simple 
an approximation as you can for P,, ; 


1.8 Combinatorial Methods 


Many problems of counting the number of outcomes in an event amount to 
counting how many subsets of a certain size are contained ina fixed set. This section 
gives examples of how to do such counting and where it can arise. 


Combinations 


Example 


Choosing Subsets. Consider the set {a, b, c, d} containing the four different letters. 


1.8.1 We want to count the number of distinct subsets of size two. In this case, we can list 


all of the subsets of size two: 


{a, b}, 


{a, c}, 


{a,d}, {b,c}, {b,d}, and {c,d}. 
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We see that there are six distinct subsets of size two. This is different from counting 
permutaions because {a, b} and {b, a} are the same subset. <l 


For large sets, it would be tedious, if not impossible, to enumerate all of the 
subsets of a given size and count them as we did in Example 1.8.1. However, there 
is a connection between counting subsets and counting permutations that will allow 
us to derive the general formula for the number of subsets. 

Suppose that there is a set of n distinct elements from which it is desired to 
choose a subset containing k elements (1 < k <n). We shall determine the number of 
different subsets that can be chosen. In this problem, the arrangement of the elements 
in a subset is irrelevant and each subset is treated as a unit. 


Combinations. Consider a set with n elements. Each subset of size k chosen from this 
set is called a combination of n elements taken k at a time. We denote the number of 
distinct such combinations by the symbol C,, x. 


No two combinations will consist of exactly the same elements because two 
subsets with the same elements are the same subset. 

At the end of Example 1.8.1, we noted that two different permutations (a, b) 
and (b, a) both correspond to the same combination or subset {a, b}. We can think of 
permutations as being constructed in two steps. First, a combination of k elements is 
chosen out of n, and second, those k elements are arranged in a specific order. There 
are C,, , ways to choose the k elements out of n, and for each such choice there are 
k! ways to arrange those k elements in different orders. Using the multiplication rule 
from Sec. 1.7, we see that the number of permutations of n elements taken k at a time 
is P, ~ = Cy, xk! hence, we have the following. 


Combinations. The number of distinct subsets of size k that can be chosen from a set 
of size n is 


P. ! 
Cx _ n,k _ n ; 
kt kin—b)! 


In Example 1.8.1, we see that Cy 5 = 4!/[2!2!] = 6. 


Selecting a Committee. Suppose that a committee composed of eight people is to be 
selected from a group of 20 people. The number of different groups of people that 
might be on the committee is 


20! 


= — = 125,970. < 
8!12! 


Cr0,8 


Choosing Jobs. Suppose that, in Example 1.8.2, the eight people in the committee 
each get a different job to perform on the committee. The number of ways to choose 
eight people out of 20 and assign them to the eight different jobs is the number of 
permutations of 20 elements taken eight at a time, or 


Py0,g = Cop,g x 8! = 125,970 x 8! = 5,078, 110,400. < 


Examples 1.8.2 and 1.8.3 illustrate the difference and relationship between com- 
binations and permutations. In Example 1.8.3, we count the same group of people in 
a different order as a different outcome, while in Example 1.8.2, we count the same 
group in different orders as the same outcome. The two numerical values differ by a 
factor of 8!, the number of ways to reorder each of the combinations in Example 1.8.2 
to get a permutation in Example 1.8.3. 
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Binomial Coefficients 


Binomial Coefficients. The number C,, ; is also denoted by the symbol (i). That is, for 
k=jO oc Ws 


n n! 
(1) ~ kn —b! (1.8.1) 


When this notation is used, this number is called a binomial coefficient. 


The name binomial coefficient derives from the appearance of the symbol in the 
binomial theorem, whose proof is left as Exercise 20 in this section. 


Binomial Theorem. For all numbers x and y and each positive integer n, 


(x + y)" = » (eae = 


k=0 


There are a couple of useful relations between binomial coefficients. 


For all n, 


For alln andallk =0,1,...,n, 


oa 
k} \n =k} 
Proof The first equation follows from the fact that 0!=1. The second equation 
follows from Eq. (1.8.1). The second equation can also be derived from the fact that 


selecting k elements to form a subset is equivalent to selecting the remaining n — k 
elements to form the complement of the subset. a 


It is sometimes convenient to use the expression “n choose k” for the value of 
C,,,;~- Thus, the same quantity is represented by the two different notations C,, , and 
(7), and we may refer to this quantity in three different ways: as the number of 
combinations of n elements taken k at a time, as the binomial coefficient of n and 


k, or simply as “n choose k.” 


Blood Types. In Example 1.6.4 on page 23, we defined genes, alleles, and genotypes. 
The gene for human blood type consists of a pair of alleles chosen from the three 
alleles commonly called O, A, and B. For example, two possible combinations of 
alleles (called genotypes) to form a blood-type gene would be BB and AO. We will 
not distinguish the same two alleles in different orders, so OA represents the same 
genotype as AO. How many genotypes are there for blood type? 

The answer could easily be found by counting, but it is an example of a more 
general calculation. Suppose that a gene consists of a pair chosen from a set of 
n different alleles. Assuming that we cannot distinguish the same pair in different 
orders, there are n pairs where both alleles are the same, and there are (5) pairs 
where the two alleles are different. The total number of genotypes is 


n+ (S) ang MD med (nh), 


2 2 2 
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For the case of blood type, we have n = 3, so there are 


(0) - 2-6 
2 2 


genotypes, as could easily be verified by counting. «J 


Note: Sampling with Replacement. The counting method described in Exam- 
ple 1.8.4 is a type of sampling with replacement that is different from the type 
described in Example 1.7.10. In Example 1.7.10, we sampled with replacement, but 
we distinguished between samples having the same balls in different orders. This 
could be called ordered sampling with replacement. In Example 1.8.4, samples con- 
taining the same genes in different orders were considered the same outcome. This 
could be called unordered sampling with replacement. The general formula for the 
number of unordered samples of size k with replacement from n elements is Cr ae 
and can be derived in Exercise 19. It is possible to have k larger than n when sampling 
with replacement. 


Selecting Baked Goods. You go to a bakery to select some baked goods for a dinner 
party. You need to choose a total of 12 items. The baker has seven different types 
of items from which to choose, with lots of each type available. How many different 
boxfuls of 12 items are possible for you to choose? Here we will not distinguish the 
same collection of 12 items arranged in different orders in the box. This is an example 
of unordered sampling with replacement because we can (indeed we must) choose 
the same type of item more than once, but we are not distinguishing the same items 


in different orders. There are aa = 18,564 different boxfuls. zl 


Example 1.8.5 raises an issue that can cause confusion if one does not carefully 
determine the elements of the sample space and carefully specify which outcomes 
(if any) are equally likely. The next example illustrates the issue in the context of 
Example 1.8.5. 


Selecting Baked Goods. Imagine two different ways of choosing a boxful of 12 baked 
goods selected from the seven different types available. In the first method, you 
choose one item at random from the seven available. Then, without regard to what 
item was chosen first, you choose the second item at random from the seven available. 
Then you continue in this way choosing the next item at random from the seven 
available without regard to what has already been chosen until you have chosen 12. 
For this method of choosing, it is natural to let the outcomes be the possible sequences 
of the 12 types of items chosen. The sample space would contain 7!7 = 1.38 x 10!° 
different outcomes that would be equally likely. 

In the second method of choosing, the baker tells you that she has available 
18,564 different boxfuls freshly packed. You then select one at random. In this case, 
the sample space would consist of 18,564 different equally likely outcomes. 

In spite of the different sample spaces that arise in the two methods of choosing, 
there are some verbal descriptions that identify an event in both sample spaces. For 
example, both sample spaces contain an event that could be described as {all 12 items 
are of the same type} even though the outcomes are different types of mathematical 
objects in the two sample spaces. The probability that all 12 items are of the same 
type will actually be different depending on which method you use to choose the 
boxful. 

In the first method, seven of the equally likely outcomes contain 12 of the 
same type of item. Hence, the probability that all 12 items are of the same type is 
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7/7 = 5.06 x 10-1. In the second method, there are seven equally liklely boxes 
that contain 12 of the same type of item. Hence, the probability that all 12 items are 
of the same type is 7/18,564 = 3.77 x 107+. Before one can compute the probability 
for an event such as {all 12 items are of the same type}, one must be careful about 
defining the experiment and its outcomes. < 


Arrangements of Elements of Two Distinct Types When a set contains only el- 
ements of two distinct types, a binomial coefficient can be used to represent the 
number of different arrangements of all the elements in the set. Suppose, for ex- 
ample, that k similar red balls and n — k similar green balls are to be arranged in a 
row. Since the red balls will occupy & positions in the row, each different arrangement 
of the n balls corresponds to a different choice of the k positions occupied by the red 
balls. Hence, the number of different arrangements of the n balls will be equal to 
the number of different ways in which k positions can be selected for the red balls 
from the n available positions. Since this number of ways is specified by the bino- 
mial coefficient (;), the number of different arrangements of the n balls is also (7). 
In other words, the number of different arrangements of n objects consisting of k 
similar objects of one type and n — k similar objects of a second type is (7). 


Tossing a Coin. Suppose that a fair coin is to be tossed 10 times, and it is desired 
to determine (a) the probability p of obtaining exactly three heads and (b) the 
probability p’ of obtaining three or fewer heads. 


(a) The total possible number of different sequences of 10 heads and tails is 2!°, 
and it may be assumed that each of these sequences is equally probable. The 
number of these sequences that contain exactly three heads will be equal to 
the number of different arrangements that can be formed with three heads and 
seven tails. Here are some of those arrangements: 


HHHTTTTTTT, HHTHTTITTT, HHTTHTTTTT, TTHTHTHTTT, etc. 


Each such arrangement is equivalent to a choice of where to put the 3 heads 
among the 10 tosses, so there are (°?) such arrangements. The probability of 


obtaining exactly three heads is then 


10 
= G) = 0.1172. 


(b) Using the same reasoning as in part (a), the number of sequences in the sample 
space that contain exactly k heads (k = 0, 1, 2, 3) is ( Hence, the probability 
of obtaining three or fewer heads is 


je (0) 4 (i) +( 9) fe (3) 


— 14104454120 176 9 yaig, 4 
210 210 


Note: Using Two Different Methods in the Same Problem. Part (a) of Exam- 
ple 1.8.7 is another example of using two different counting methods in the same 
problem. Part (b) illustrates another general technique. In this part, we broke the 
event of interest into several disjoint subsets and counted the numbers of outcomes 
separately for each subset and then added the counts together to get the total. In 
many problems, it can require several applications of the same or different counting 
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methods in order to count the number of outcomes in an event. The next example is 
one in which the elements of an event are formed in two parts (multiplication rule), 
but we need to perform separate combination calculations to determine the numbers 
of outcomes for each part. 


Sampling without Replacement. Suppose that a class contains 15 boys and 30 girls, 
and that 10 students are to be selected at random for a special assignment. We shall 
determine the probability p that exactly three boys will be selected. 

The number of different combinations of the 45 students that might be obtained 
in the sample of 10 students is (3), and the statement that the 10 students are selected 


at random means that each of these Ga) possible combinations is equally probable. 
Therefore, we must find the number of these combinations that contain exactly three 
boys and seven girls. 

When a combination of three boys and seven girls is formed, the number of 
different combinations in which three boys can be selected from the 15 available boys 
is (e), and the number of different combinations in which seven girls can be selected 


from the 30 available girls is (=). Since each of these combinations of three boys 
can be paired with each of the combinations of seven girls to form a distinct sample, 
the number of combinations containing exactly three boys is (2) ‘cue Therefore, the 


desired probability is 
15\ (30 
p= ae = 0.2904. 


Playing Cards. Suppose that a deck of 52 cards containing four aces is shuffled thor- 
oughly and the cards are then distributed among four players so that each player 
receives 13 cards. We shall determine the probability that each player will receive 
one ace. 

The number of possible different combinations of the four positions in the deck 
occupied by the four aces is (); and it may be assumed that each of these () 
combinations is equally probable. If each player is to receive one ace, then there 
must be exactly one ace among the 13 cards that the first player will receive and one 
ace among each of the remaining three groups of 13 cards that the other three players 
will receive. In other words, there are 13 possible positions for the ace that the first 
player is to receive, 13 other possible positions for the ace that the second player is to 
receive, and so on. Therefore, among the CG) possible combinations of the positions 
for the four aces, exactly 13* of these combinations will lead to the desired result. 
Hence, the probability p that each player will receive one ace is 

134 


P= Tox = 0.1055. 4 


4 


Ordered versus Unordered Samples Several of the examples in this section and 
the previous section involved counting the numbers of possible samples that could 
arise using various sampling schemes. Sometimes we treated the same collection of 
elements in different orders as different samples, and sometimes we treated the same 
elements in different orders as the same sample. In general, how can one tell which 
is the correct way to count in a given problem? Sometimes, the problem description 
will make it clear which is needed. For example, if we are asked to find the probability 
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that the items in a sample arrive in a specified order, then we cannot even specify the 
event of interest unless we treat different arrangements of the same items as different 
outcomes. Examples 1.8.5 and 1.8.6 illustrate how different problem descriptions can 
lead to very different calculations. 

However, there are cases in which the problem description does not make it clear 
whether or not one must count the same elements in different orders as different 
outcomes. Indeed, there are some problems that can be solved correctly both ways. 
Example 1.8.9 is one such problem. In that problem, we needed to decide what we 
would call an outcome, and then we needed to count how many outcomes were in the 
whole sample space S and how many were in the event E of interest. In the solution 
presented in Example 1.8.9, we chose as our outcomes the positions in the 52-card 
deck that were occupied by the four aces. We did not count different arrangements 
of the four aces in those four positions as different outcomes when we counted the 
number of outcomes in S. Hence, when we calculated the number of outcomes in E, 
we also did not count the different arrangements of the four aces in the four possible 
positions as different outcomes. In general, this is the principle that should guide the 
choice of counting method. If we have the choice between whether or not to count 
the same elements in different orders as different outcomes, then we need to make 
our choice and be consistent throughout the problem. If we count the same elements 
in different orders as different outcomes when counting the outcomes in S, we must 
do the same when counting the elements of E. If we do not count them as different 
outcomes when counting S, we should not count them as different when counting E. 


Playing Cards, Revisited. We shall solve the problem in Example 1.8.9 again, but this 
time, we shall distinguish outcomes with the same cards in different orders. To go 
to the extreme, let each outcome be a complete ordering of the 52 cards. So, there 
are 52! possible outcomes. How many of these have one ace in each of the four sets 
of 13 cards received by the four players? As before, there are 13+ ways to choose 
the four positions for the four aces, one among each of the four sets of 13 cards. No 
matter which of these sets of positions we choose, there are 4! ways to arrange the 
four aces in these four positions. No matter how the aces are arranged, there are 48! 
ways to arrange the remaining 48 cards in the 48 remaining positions. So, there are 
134 x 4! x 48! outcomes in the event of interest. We then calculate 
134 x 4! x 48! 


p= = 0.1085. < 


In the following example, whether one counts the same items in different orders 
as different outcomes is allowed to depend on which events one wishes to use. 


Lottery Tickets. Ina lottery game, six numbers from 1 to 30 are drawn at random from 
a bin without replacement, and each player buys a ticket with six different numbers 
from 1 to 30. If all six numbers drawn match those on the player’s ticket, the player 
wins. We assume that all possible draws are equally likely. One way to construct a 
sample space for the experiment of drawing the winning combination is to consider 
the possible sequences of draws. That is, each outcome consists of an ordered subset 
of six numbers chosen from the 30 available numbers. There are P39 6 = 30!/24! such 
outcomes. With this sample space S, we can calculate probabilities for events such as 


A = {the draw contains the numbers 1, 14, 15, 20, 23, and 27}, 
B = {one of the numbers drawn is 15}, and 
C = {the first number drawn is less than 10}. 
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There is another natural sample space, which we shall denote S$’, for this experiment. 
It consists solely of the different combinations of six numbers drawn from the 30 
available. There are (| = 30!/(6!24!) such outcomes. It also seems natural to consider 
all of these outcomes equally likely. With this sample space, we can calculate the 
probabilities of the events A and B above, but C is not a subset of the sample space 
S’, so we cannot calculate its probability using this smaller sample space. When the 
sample space for an experiment could naturally be constructed in more than one way, 
one needs to choose based on for which events one wants to compute probabilities. 


< 


Example 1.8.11 raises the question of whether one will compute the same prob- 
abilities using two different sample spaces when the event, such as A or B, exists 
in both sample spaces. In the example, each outcome in the smaller sample space 
S’ corresponds to an event in the larger sample space S. Indeed, each outcome s’ 
in S’ corresponds to the event in S containing the 6! permutations of the single 
combination s’. For example, the event A in the example has only one outcome 
s’ = (1, 14, 15, 20, 23, 27) in the sample space S’, while the corresponding event in 
the sample space S has 6! permutations including 


(1, 14, 15, 20, 23, 27), (14, 20, 27, 15, 23, 1), (27, 23, 20, 15, 14, 1), ete. 
In the sample space S, the probability of the event A is 


! 124! 
Pr(A) = 6! _ O124t 1 


Pyg 30!) 


In the sample space S’, the event A has this same probability because it has only one 
of the () equally likely outcomes. The same reasoning applies to every outcome in 
S’. Hence, if the same event can be expressed in both sample spaces S and S’, we 
will compute the same probability using either sample space. This is a special feature 
of examples like Example 1.8.11 in which each outcome in the smaller sample space 
corresponds to an event in the larger sample space with the same number of elements. 
There are examples in which this feature is not present, and one cannot treat both 
sample spaces as simple sample spaces. 


Tossing Coins. An experiment consists of tossing a coin two times. If we want to 
distinguish H followed by T from T followed by H, we should use the sample space 
S ={HH, HT, TH, TT}, which might naturally be assumed a simple sample space. 
On the other hand, we might be interested solely in the number of H’s tossed. In this 
case, we might consider the smaller sample space S’ = {0, 1, 2} where each outcome 
merely counts the number of H’s. The outcomes 0 and 2 in S’ each correspond to 
a single outcome in S, but 1 € S’ corresponds to the event {HT, TH} C S with two 
outcomes. If we think of S as a simple sample space, then S’ will not be a simple 
sample space, because the outcome 1 will have probability 1/2 while the other two 
outcomes each have probability 1/4. 

There are situations in which one would be justified in treating S’ as a simple 
sample space and assigning each of its outcomes probability 1/3. One might do this 
if one believed that the coin was not fair, but one had no idea how unfair it was or 
which side were more likely to land up. In this case, § would not be a simple sample 
space, because two of its outcomes would have probability 1/3 and the other two 
would have probabilities that add up to 1/3. < 
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Example 1.8.6 is another case of two different sample spaces in which each 
outcome in one sample space corresponds to a different number of outcomes in the 
other space. See Exercise 12 in Sec. 1.9 for a more complete analysis of Example 1.8.6. 


The Tennis Tournament 


We shall now present a difficult problem that has a simple and elegant solution. 
Suppose that n tennis players are entered in a tournament. In the first round, the 
players are paired one against another at random. The loser in each pair is eliminated 
from the tournament, and the winner in each pair continues into the second round. 
If the number of players n is odd, then one player is chosen at random before the 
pairings are made for the first round, and that player automatically continues into 
the second round. All the players in the second round are then paired at random. 
Again, the loser in each pair is eliminated, and the winner in each pair continues 
into the third round. If the number of players in the second round is odd, then one 
of these players is chosen at random before the others are paired, and that player 
automatically continues into the third round. The tournament continues in this way 
until only two players remain in the final round. They then play against each other, 
and the winner of this match is the winner of the tournament. We shall assume that 
all n players have equal ability, and we shall determine the probability p that two 
specific players A and B will ever play against each other during the tournament. 

We shall first determine the total number of matches that will be played during 
the tournament. After each match has been played, one player—the loser of that 
match—is eliminated from the tournament. The tournament ends when everyone 
has been eliminated from the tournament except the winner of the final match. Since 
exactly n — 1 players must be eliminated, it follows that exactly n — 1 matches must 
be played during the tournament. 

The number of possible pairs of players is (5). Each of the two players in every 
match is equally likely to win that match, and all initial pairings are made in a random 
manner. Therefore, before the tournament begins, every possible pair of players is 
equally likely to appear in each particular one of the n — 1 matches to be played 
during the tournament. Accordingly, the probability that players A and B will meet 
in some particular match that is specified in advance is 1/(5). If A and B do meet in 
that particular match, one of them will lose and be eliminated. Therefore, these same 
two players cannot meet in more than one match. 

It follows from the preceding explanation that the probability p that players A 
and B will meet at some time during the tournament is equal to the product of the 
probability 1/(5) that they will meet in any particular specified match and the total 
number n — 1 of different matches in which they might possibly meet. Hence, 


_n-1_ 2 


(3) " 


4, 
“ 


Summary 


We showed that the number of size k subsets of a set of size n is (7) =n!/[k\(n — 
k)!]. This turns out to be the number of possible samples of size k drawn without 
replacement from a population of size n as well as the number of arrangements of n 
items of two types with k of one type and n — k of the other type. We also saw several 
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examples in which more than one counting technique was required at different points 
in the same problem. Sometimes, more than one technique is required to count the 


elements of a single set. 


Exercises 


1. Two pollsters will canvas a neighborhood with 20 
houses. Each pollster will visit 10 of the houses. How many 
different assignments of pollsters to houses are possible? 


2. Which of the following two numbers is larger: (33) or 


93 
(31)? 
3. Which of the following two numbers is larger: (33) or 
93 
(53)? 
4. A box contains 24 light bulbs, of which four are defec- 
tive. If a person selects four bulbs from the box at random, 


without replacement, what is the probability that all four 
bulbs will be defective? 


5. Prove that the following number is an integer: 


4155 x 4156 x --- x 4250 x 4251 
2x3x--+x 96 x 97 
6. Suppose that n people are seated in a random manner 
in a row of n theater seats. What is the probability that 


two particular people A and B will be seated next to each 
other? 


7. If k people are seated in a random manner in a row 
containing n seats (n > k), what is the probability that the 
people will occupy k adjacent seats in the row? 


8. If k people are seated in a random manner in a circle 
containing n chairs (n > k), what is the probability that the 
people will occupy k adjacent chairs in the circle? 


9. If n people are seated in a random manner in a row 
containing 2n seats, what is the probability that no two 
people will occupy adjacent seats? 


10. A box contains 24 light bulbs, of which two are de- 
fective. If a person selects 10 bulbs at random, without 
replacement, what is the probability that both defective 
bulbs will be selected? 


11. Suppose that a committee of 12 people is selected in 
arandom manner from a group of 100 people. Determine 
the probability that two particular people A and B will 
both be selected. 


12. Suppose that 35 people are divided in a random man- 
ner into two teams in such a way that one team contains 
10 people and the other team contains 25 people. What is 
the probability that two particular people A and B will be 
on the same team? 


13. A box contains 24 light bulbs of which four are de- 
fective. If one person selects 10 bulbs from the box in 
a random manner, and a second person then takes the 
remaining 14 bulbs, what is the probability that all four 
defective bulbs will be obtained by the same person? 


14. Prove that, for all positive integers n and k (n> k), 
Sue aes, 
k k-1 k 
15. 


a. Prove that 


b. Prove that 


(;) (‘) | (;) @ paced y(") =o. 


Hint: Use the binomial theorem. 


16. The United States Senate contains two senators from 
each of the 50 states. (a) If a committee of eight senators 
is selected at random, what is the probability that it will 
contain at least one of the two senators from a certain 
specified state? (b) What is the probability that a group 
of 50 senators selected at random will contain one senator 
from each state? 


17. A deck of 52 cards contains four aces. If the cards 
are shuffled and distributed in a random manner to four 
players so that each player receives 13 cards, what is the 
probability that all four aces will be received by the same 
player? 


18. Suppose that 100 mathematics students are divided 
into five classes, each containing 20 students, and that 
awards are to be given to 10 of these students. If each 
student is equally likely to receive an award, what is the 
probability that exactly two students in each class will 
receive awards? 


19. A restaurant has n items on its menu. During a partic- 
ular day, k customers will arrive and each one will choose 
one item. The manager wants to count how many dif- 
ferent collections of customer choices are possible with- 
out regard to the order in which the choices are made. 
(For example, if k =3 and a,..., a, are the menu items, 
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then aja3a, is not distinguished from a,a,a3.) Prove that 
the number of different collections of customer choices is 
ea Hint: Assume that the menu items are ay, ... , a). 
Show that each collection of customer choices, arranged 
with the a,’s first, the ay’s second, etc., can be identified 
with a sequence of k zeros and n — 1 ones, where each 0 
stands for a customer choice and each 1 indicates a point 
in the sequence where the menu item number increases 
by 1. For example, if k = 3 and n =5, then a,a,a3 becomes 
0011011. 


20. Prove the binomial theorem 1.8.2. Hint: You may use 
an induction argument. That is, first prove that the result 
is true if n = 1. Then, under the assumption that there is 


ng such that the result is true for all n < no, prove that it is 
also true for n =no + 1. 


21. Return to the birthday problem on page 30. How 
many different sets of birthdays are available with k peo- 
ple and 365 days when we don’t distinguish the same 
birthdays in different orders? For example, if k = 3, we 
would count (Jan. 1, Mar. 3, Jan.1) the same as (Jan. 1, 
Jan. 1, Mar. 3). 


22. Let n be a large even integer. Use Stirlings’ formula 
(Theorem 1.7.5) to find an approximation to the binomial 
coefficient (1/2): Compute the approximation with n = 
500. 


Example 
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1.9 Multinomial Coefficients 


We learn how to count the number of ways to partition a finite set into more than 
two disjoint subsets. This generalizes the binomial coefficients from Sec. 1.8. The 
generalization is useful when outcomes consist of several parts selected from a 
fixed number of distinct types. 


We begin with a fairly simple example that will illustrate the general ideas of this 
section. 


Choosing Committees. Suppose that 20 members of an organization are to be divided 
into three committees A, B, and C in such a way that each of the committees A and 
B is to have eight members and committee C is to have four members. We shall 
determine the number of different ways in which members can be assigned to these 
committees. Notice that each of the 20 members gets assigned to one and only one 
committee. 

One way to think of the assignments is to form committee A first by choosing its 
eight members and then split the remaining 12 members into committees B and C. 
Each of these operations is choosing a combination, and every choice of committee 
A can be paired with every one of the splits of the remaining 12 members into 
committees B and C. Hence, the number of assignments into three committees is 
the product of the numbers of combinations for the two parts of the assignment. 
Specifically, to form committee A, we must choose eight out of 20 members, and this 


can be done in (2) ways. Then to split the remaining 12 members into committees B 
and C there are are (3) ways to do it. Here, the answer is 
1 12! ! 
e (*) Sa OP TE 6) 5 18D. < 
8/\8 8112! 814! 8 18!4! 


Notice how the 12! that appears in the denominator of () divides out with the 12! 


that appears in the numerator of Cy. This fact is the key to the general formula that 
we shall derive next. 


In general, suppose that n distinct elements are to be divided into k different 
groups (k > 2) in such a way that, for j =1,..., 4, the jth group contains exactly 
n, elements, where nj +17 +--- +n =n. It is desired to determine the number 
of different ways in which the n elements can be divided into the k groups. The 
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n, elements in the first group can be selected from the n available elements in es 
different ways. After the n,; elements in the first group have been selected, the n 
elements in the second group can be selected from the remaining n — n, elements 


in Ca) different ways. Hence, the total number of different ways of selecting the 


elements for both the first group and the second group is ( a ee After the ny +n 
elements in the first two groups have been selected, the number of different ways in 
which the n3 elements in the third group can be selected is eae Hence, the total 


number of different ways of selecting the elements for the first three groups is 


(")(" (" —) 
ny ng n3 ; 


It follows from the preceding explanation that, for each j =1,...,k —2 after 
the first j groups have been formed, the number of different ways in which the nj+1 
elements in the next group (j + 1) can be selected from the remaining n — n, —---— 
n; elements is lame Sm After the elements of group k — 1 have been selected, 
the remaining n, elements must then form the last group. Hence, the total number 
of different ways of dividing the n elements into the k groups is 


n\(n—1ny\(n—ny—n2 R= =" —Npo\. n! 
ny nN n3 Np_4 nny!-- + n,!’ 


where the last formula follows from writing the binomial coefficients in terms of 
factorials. 


Multinomial Coefficients. The number 
] 
= , which we shall denote by ( " ), 
ny1,9, > Nk 


nying!++- ny! 


is called a multinomial coefficient. 


The name multinomial coefficient derives from the appearance of the symbol in the 
multinomial theorem, whose proof is left as Exercise 11 in this section. 


Multinomial Theorem. For all numbers xj, ..., x; and each positive integer n, 


n 
oy tetap"= > ( staf 
N41, N7,.-+, ME 


where the summation extends over all possible combinations of nonnegative integers 
ny,..., ny, such that ny +n.+---+n, =n. | 


A multinomial coefficient is a generalization of the binomial coefficient discussed 
in Sec. 1.8. For k = 2, the multinomial theorem is the same as the binomial theorem, 
and the multinomial coefficient becomes a binomial coefficient. In particular, 


a ‘a J) : ({). 


Choosing Committees. In Example 1.9.1, we see that the solution obtained there is the 
same as the multinomial coefficient for which n = 20, k = 3, nj = n> = 8, and n3 = 4, 


namely, 

! 

( a0 )- 2 ass 150 < 
8, 8, 4 (8!)24! 
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Arrangements of Elements of More Than Two Distinct Types Just as binomial 
coefficients can be used to represent the number of different arrangements of the 
elements of a set containing elements of only two distinct types, multinomial coeffi- 
cients can be used to represent the number of different arrangements of the elements 
of a set containing elements of k different types (k > 2). Suppose, for example, that 
n balls of k different colors are to be arranged in a row and that there are n; balls 
of color j (j =1,...,k), where n, +no +---+n, =n. Then each different arrange- 
ment of the n balls corresponds to a different way of dividing the n available positions 
in the row into a group of 1, positions to be occupied by the balls of color 1, a second 
group of n> positions to be occupied by the balls of color 2, and so on. Hence, the 
total number of different possible arrangements of the n balls must be 


n n! 
14, N7,..-, MK nino! ++ +ny! 


Rolling Dice. Suppose that 12 dice are to be rolled. We shall determine the probability 
p that each of the six different numbers will appear twice. 

Each outcome in the sample space S can be regarded as an ordered sequence 
of 12 numbers, where the ith number in the sequence is the outcome of the ith roll. 
Hence, there will be 6!2 possible outcomes in S, and each of these outcomes can 
be regarded as equally probable. The number of these outcomes that would contain 
each of the six numbers 1, 2,..., 6 exactly twice will be equal to the number of 
different possible arrangements of these 12 elements. This number can be determined 
by evaluating the multinomial coefficient for which n = 12,k = 6, andny =no=---= 
no = 2. Hence, the number of such outcomes is 


( 12 )- 12! 
9.3.9.9.9.9)7 One 


and the required probability p is 


12! 


Pp 


Playing Cards. A deck of 52 cards contains 13 hearts. Suppose that the cards are 
shuffled and distributed among four players A, B, C, and D so that each player 
receives 13 cards. We shall determine the probability p that player A will receive 
six hearts, player B will receive four hearts, player C will receive two hearts, and 
player D will receive one heart. 

The total number N of different ways in which the 52 cards can be distributed 
among the four players so that each player receives 13 cards is 


( 52 ) 52! 
N= a 
13, 13, 13,13) (34 


It may be assumed that each of these ways is equally probable. We must now calculate 
the number M of ways of distributing the cards so that each player receives the 
required number of hearts. The number of different ways in which the hearts can 
be distributed to players A, B, C, and D so that the numbers of hearts they receive 
are 6, 4, 2, and 1, respectively, is 


( 13 )- 13! 
6,4,2,1/  6!4i2nt 
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Also, the number of different ways in which the other 39 cards can then be distributed 
to the four players so that each will have a total of 13 cards is 


39 a ee 
e 9, 11, y TON 2!" 
Therefore, 
Me 13). 30! 
Ol4!2!1! 719111112! 
and the required probability p is 


M _ 13!391(13!)4 

N  6!4!2!1!7!9111!12152! 
There is another approach to this problem along the lines indicated in Exam- 

ple 1.8.9 on page 37. The number of possible different combinations of the 13 posi- 

tions in the deck occupied by the hearts is (i If player A is to receive six hearts, 

there are () possible combinations of the six positions these hearts occupy among 

the 13 cards that A will receive. Similarly, if player B is to receive four hearts, there 


are () possible combinations of their positions among the 13 cards that B will re- 


ie = 0.00196. 


ceive. There are (') possible combinations for player C, and there are (!?) possible 
combinations for player D. Hence, 


_)(@)G)@) 
(13) 
which produces the same value as the one obtained by the first method of solution. 
< 


Summary 


Multinomial coefficients generalize binomial coefficients. The coefficient la a“ a is 
the number of ways to partition a set of n items into distinguishable subsets of sizes 
ny,..., My Where ny +---+n, =n. It is also the number of arrangements of n items 
of k different types for which n; are of typei fori =1,...,k. Example 1.9.4 illustrates 


another important point to remember about computing probabilities: There might 


be more than one correct method for computing the same probability. 


Exercises 


1. Three pollsters will canvas a neighborhood with 21 
houses. Each pollster will visit seven of the houses. How 
many different assignments of pollsters to houses are pos- 
sible? 


2. Suppose that 18 red beads, 12 yellow beads, eight blue 
beads, and 12 black beads are to be strung in a row. How 
many different arrangements of the colors can be formed? 


3. Suppose that two committees are to be formed in an 
organization that has 300 members. If one committee is 


to have five members and the other committee is to have 
eight members, in how many different ways can these 
committees be selected? 


4. If the letters s, 5,5, t,t, t, i, 1, a, c are arranged in a 
random order, what is the probability that they will spell 
the word “statistics”? 


5. Suppose that n balanced dice are rolled. Determine the 
probability that the number j will appear exactly n ; times 
(j =1,..., 6), where n, +n.+...+ng=n. 
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6. Ifseven balanced dice are rolled, what is the probability 
that each of the six different numbers will appear at least 
once? 


7. Suppose that a deck of 25 cards contains 12 red cards. 
Suppose also that the 25 cards are distributed in a random 
manner to three players A, B, and C in such a way that 
player A receives 10 cards, player B receives eight cards, 
and player C receives seven cards. Determine the proba- 
bility that player A will receive six red cards, player B will 
receive two red cards, and player C will receive four red 
cards. 


8. A deck of 52 cards contains 12 picture cards. If the 
52 cards are distributed in a random manner among four 
players in such a way that each player receives 13 cards, 
what is the probability that each player will receive three 
picture cards? 


9. Suppose that a deck of 52 cards contains 13 red cards, 
13 yellow cards, 13 blue cards, and 13 green cards. If the 
52 cards are distributed in a random manner among four 
players in such a way that each player receives 13 cards, 
what is the probability that each player will receive 13 
cards of the same color? 


10. Suppose that two boys named Davis, three boys 
named Jones, and four boys named Smith are seated at 
random in a row containing nine seats. What is the prob- 
ability that the Davis boys will occupy the first two seats 
in the row, the Jones boys will occupy the next three seats, 
and the Smith boys will occupy the last four seats? 


11. Prove the multinomial theorem 1.9.1. (You may wish 
to use the same hint as in Exercise 20 in Sec. 1.8.) 


12. Return to Example 1.8.6. Let S be the larger sample 
space (first method of choosing) and let S’ be the smaller 
sample space (second method). For each element s’ of 8", 
let N(s’) stand for the number of elements of S that lead to 
the same boxful s’ when the order of choosing is ignored. 


a. For each s’ € S’, find a formula for N(s’). Hint: Let 
n; Stand for the number of items of type i in s’ for 
P= lyons Te 

b. Verify that > ..<5 N(s’) equals the number of out- 
comes in S. 


1.10 The Probability of a Union of Events 


The axioms of probability tell us directly how to find the probability of the union 
of disjoint events. Theorem 1.5.7 showed how to find the probability for the union 
of two arbitrary events. This theorem is generalized to the union of an arbitrary 


finite collection of events. 


We shall now consider again an arbitrary sample space S that may contain either a 
finite number of outcomes or an infinite number, and we shall develop some further 
general properties of the various probabilities that might be specified for the events 
in S. In this section, we shall study in particular the probability of the union )"_, A; 


of n events Aj,..., Ay. 
If the events Aj, .. 


., A, are disjoint, we know that 


m(U 4) = ey Pr(A;). 
i=1 i=1 


Furthermore, for every two events A, and A>, regardless of whether or not they are 
disjoint, we know from Theorem 1.5.7 of Sec. 1.5 that 


Pr(Ay U A>) = Pr(Aj) + Pr(A>) = Pr(Ay ia) Ap). 


In this section, we shall extend this result, first to three events and then to an arbitrary 


finite number of events. 


The Union of Three Events 


Theorem 
1.10.1 


For every three events Aj, A>, and A3, 


Example 
1.10.1 
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Pr(Ay U Ay U A3) = Pr(Ay) + Pr(Ag) + Pr(A3) 
= [Pr(A, fal A) a Pr(A, M A3) + Pr(Ay a) A3)] 
+ Pr(A,N Az As). (1.10.1) 


Proof By the associative property of unions (Theorem 1.4.6), we can write 
A, U Ay U A3 = (Ay U Ad) U Ap. 
Apply Theorem 1.5.7 to the two events A = A, U A) and B = A; to obtain 
Pr(Aj U Ay U Az) = Pr(A U B) 
= Pr(A) + Pr(B) — Pr(AN B). (1.10.2) 


We next compute the three probabilities on the far right side of (1.10.2) and combine 
them to get (1.10.1). First, apply Theorem 1.5.7 to the two events A, and A, to obtain 


Pr(A) = Pr(A,) + Pr(Az) — Pr(A, NM A)). (1.10.3) 
Next, use the first distributive property in Theorem 1.4.10 to write 
AN B= (A, U Ax) N Az = (Ay N Az) U (Ad As). (1.10.4) 
Apply Theorem 1.5.7 to the events on the far right side of (1.10.4) to obtain 
Pr(A M B) = Pr(A,M A3) + Pr(Az N A3) — Pr(AyN A2MA3). (1.10.5) 


Substitute (1.10.3), Pr(B) = Pr(A3), and (1.10.5) into (1.10.2) to complete the proof. 
a 


Student Enrollment. Among a group of 200 students, 137 students are enrolled in a 
mathematics class, 50 students are enrolled in a history class, and 124 students are 
enrolled in a music class. Furthermore, the number of students enrolled in both the 
mathematics and history classes is 33, the number enrolled in both the history and 
music classes is 29, and the number enrolled in both the mathematics and music 
classes is 92. Finally, the number of students enrolled in all three classes is 18. We 
shall determine the probability that a student selected at random from the group of 
200 students will be enrolled in at least one of the three classes. 

Let A, denote the event that the selected student is enrolled in the mathematics 
class, let A, denote the event that he is enrolled in the history class, and let A; 
denote the event that he is enrolled in the music class. To solve the problem, we 
must determine the value of Pr(A; U A> U A3). From the given numbers, 

124 


137 50 
Pr(A,) = —.,  Pr(Ao) = —, Pr(A3) = —, 
T(A)) 500 (A) 500 T(A3) 500 


33 29 92 

Pr(A, 9 Ao) = —., Pr(A.NA3) = —, Pr(AyN Az) = —., 
T(A,/M Ap) 700 T(Az 1M A3) 500 T(A,/ A3) 500 
ee hae are oe 
Oe a 


It follows from Eq. (1.10.1) that Pr(A; U Az U A3) = 175/200 = 7/8. < 


The Union of a Finite Number of Events 


A result similar to Theorem 1.10.1 holds for any arbitrary finite number of events, as 
shown by the following theorem. 
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Theorem 
1.10.2 


For every n events Aj,..., Ay, 
n n 
n(U 4) = os Pr(A;) — >» Pr(Aj N. Aj) + > Pr(Aj NA; Ay) 
i=1 i=l i<j i<j<k 


— > Pr(A;NA;N ARN AD +> (1.10.6) 


i<j<k<l 


A(t)! Pitas iAs iA) 


Proof The proof proceeds by induction. In particular, we first establish that (1.10.6) 
is true for n = 1 and n = 2. Next, we show that if there exists m such that (1.10.6) is 
true for all n < m, then (1.10.6) is also true for n =m + 1. The case of n = 1 is trivial, 
and the case of n = 2 is Theorem 1.5.7. To complete the proof, assume that (1.10.6) 


is true for all n < m. Let Aj,..., Aj,41 be events. Define A = |)”_, A; and B= A, 44. 
Theorem 1.5.7 says that 
n 
»(U 4) — Pr(A U B) = Pr(A) + Pr(B) — Pr(AN B). (1.10.7) 
i=1 


We have assumed that Pr(A) equals (1.10.6) with n =m. We need to show that when 
we add Pr(A) to Pr(B) — Pr(A OB), we get (1.10.6) with n =m + 1. The difference 
between (1.10.6) with n =m +1 and Pr(A) is all of the terms in which one of the 
subscripts (i, j, k, etc.) equals m + 1. Those terms are the following: 
m 
Pr(Am4i) — > Pr(Aj Anyi) + >) Pr(Ap Aj Amst) 
i=l i<j 
— 3 Pr(A;N A; NAN Amst) + °° 
i<j<k 
=F (-1y"*? Pr(Ay; M Ag (nes) Am a Ais) 


The first term in (1.10.8) is Pr(B) = Pr(A,,,,). All that remains is to show that 
— Pr(A N B) equals all but the first term in (1.10.8). 

Use the natural generalization of the distributive property (Theorem 1.4.10) to 
write 


(1.10.8) 


m m 

ANB= (U 4) A Ams =(J(Ai 9 Am41)- (1.10.9) 
i=1 i=l 

The union in (1.10.9) contains m events, and hence we can apply (1.10.6) with n =m 

and each A; replaced by A; 1 A,,,;. The result is that — Pr(A M B) equals all but the 

first term in (1.10.8). = 


The calculation in Theorem 1.10.2 can be outlined as follows: First, take the 
sum of the probabilities of the n individual events. Second, subtract the sum of the 
probabilities of the intersections of all possible pairs of events; in this step, there 
will be (5) different pairs for which the probabilities are included. Third, add the 
probabilities of the intersections of all possible groups of three of the events; there 
will be (3) intersections of this type. Fourth, subtract the sum of the probabilities 
of the intersections of all possible groups of four of the events; there will be (1) 
intersections of this type. Continue in this way until, finally, the probability of the 
intersection of all n events is either added or subtracted, depending on whether n is 
an odd number or an even number. 


%, 
“ 
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The Matching Problem 


Suppose that all the cards in a deck of n different cards are placed in a row, and that 
the cards in another similar deck are then shuffled and placed in a row on top of the 
cards in the original deck. It is desired to determine the probability p, that there 
will be at least one match between the corresponding cards from the two decks. The 
same problem can be expressed in various entertaining contexts. For example, we 
could suppose that a person types n letters, types the corresponding addresses on n 
envelopes, and then places the n letters in the n envelopes in a random manner. It 
could be desired to determine the probability p,, that at least one letter will be placed 
in the correct envelope. As another example, we could suppose that the photographs 
of n famous film actors are paired in a random manner with n photographs of the 
same actors taken when they were babies. It could then be desired to determine the 
probability p,, that the photograph of at least one actor will be paired correctly with 
this actor’s own baby photograph. 

Here we shall discuss this matching problem in the context of letters being placed 
in envelopes. Thus, we shall let A; be the event that letter i is placed in the correct 
envelope (i =1,...,m), and we shall determine the value of p, = Pr eae Aj) by 
using Eq. (1.10.6). Since the letters are placed in the envelopes at random, the 
probability Pr(A;) that any particular letter will be placed in the correct envelope 
is 1/n. Therefore, the value of the first summation on the right side of Eq. (1.10.6) is 


= 1 
So Pr(A)) =n == 
= e 


Furthermore, since letter 1 could be placed in any one of n envelopes and letter 
2 could then be placed in any one of the other n — 1 envelopes, the probability 
Pr(A, 9 A>) that both letter 1 and letter 2 will be placed in the correct envelopes 
is 1/[n(n — 1)]. Similarly, the probability Pr(A; 1 A;) that any two specific letters i 
and j (i 4 j) will both be placed in the correct envelopes is 1/[n(n — 1)]. Therefore, 
the value of the second summation on the right side of Eq. (1.10.6) is 


1 1 
y- Pr(A;N Aj) = ()— Th 


i<j 


By similar reasoning, it can be determined that the probability Pr(A; 0 A; 1 Ax) 
that any three specific letters i, 7, and k (i < j <k) will be placed in the correct 
envelopes is 1/[n(n — 1)(n — 2)]. Therefore, the value of the third summation is 


n 1 1 
2 a= (3) —Din—2) 3! 


i<j<k 


This procedure can be continued until it is found that the probability Pr(A,; 9 
A2---NA,) that all n letters will be placed in the correct envelopes is 1/(n!). It now 
follows from Eq. (1.10.6) that the probability p,, that at least one letter will be placed 
in the correct envelope is 


1 
f= Sb Se ES bees pra 
Py=l rT + aT ++ +(-) a (1.10.10) 
This probability has the following interesting features. As n — ov, the value of 
Pn approaches the following limit: 
1 1 1 


a a ar a 
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It is shown in books on elementary calculus that the sum of the infinite series on 
the right side of this equation is 1 — (1/e), where e = 2.71828... . Hence, 1 — (1/e) = 
0.63212... . It follows that for a large value of n, the probability p, that at least one 
letter will be placed in the correct envelope is approximately 0.63212. 

The exact values of p,, as given in Eq. (1.10.10), will form an oscillating sequence 
as n increases. As n increases through the even integers 2, 4, 6, ..., the values of p,, 
will increase toward the limiting value 0.63212; and as n increases through the odd 
integers 3, 5, 7,..., the values of p, will decrease toward this same limiting value. 

The values of p, converge to the limit very rapidly. In fact, for n = 7 the exact 
value p7 and the limiting value of p, agree to four decimal places. Hence, regardless 
of whether seven letters are placed at random in seven envelopes or seven million 
letters are placed at random in seven million envelopes, the probability that at least 
one letter will be placed in the correct envelope is 0.6321. 


o, 
“ 


Summary 


We generalized the formula for the probability of the union of two arbitrary events 
to the union of finitely many events. As an aside, there are cases in which it is 
easier to compute Pr(A,U...UA,) as 1— Pr(A{--- A‘) using the fact that 


(A,U...UA,)° = ASN-- NAS. 


Exercises 


1. Three players are each dealt, in a random manner, five 
cards from a deck containing 52 cards. Four of the 52 
cards are aces. Find the probability that at least one person 
receives exactly two aces in their five cards. 


2. In a certain city, three newspapers A, B, and C are 
published. Suppose that 60 percent of the families in the 
city subscribe to newspaper A, 40 percent of the families 
subscribe to newspaper B, and 30 percent subscribe to 
newspaper C. Suppose also that 20 percent of the families 
subscribe to both A and B, 10 percent subscribe to both 
A and C, 20 percent subscribe to both B and C, and 5 
percent subscribe to all three newspapers A, B, and C. 
What percentage of the families in the city subscribe to at 
least one of the three newspapers? 


3. For the conditions of Exercise 2, what percentage of 
the families in the city subscribe to exactly one of the three 
newspapers? 


4. Suppose that three compact discs are removed from 
their cases, and that after they have been played, they are 
put back into the three empty cases in a random manner. 
Determine the probability that at least one of the CD’s 
will be put back into the proper cases. 


5. Suppose that four guests check their hats when they 
arrive at a restaurant, and that these hats are returned to 


them in a random order when they leave. Determine the 
probability that no guest will receive the proper hat. 


6. A box contains 30 red balls, 30 white balls, and 30 blue 
balls. If 10 balls are selected at random, without replace- 
ment, what is the probability that at least one color will be 
missing from the selection? 


7. Suppose that a school band contains 10 students from 
the freshman class, 20 students from the sophomore class, 
30 students from the junior class, and 40 students from the 
senior class. If 15 students are selected at random from 
the band, what is the probability that at least one student 
will be selected from each of the four classes? Hint: First 
determine the probability that at least one of the four 
classes will not be represented in the selection. 


8. If n letters are placed at random in n envelopes, what 
is the probability that exactly n — 1 letters will be placed 
in the correct envelopes? 


9. Suppose that n letters are placed at random in n en- 
velopes, and let g,, denote the probability that no letter is 
placed in the correct envelope. For which of the follow- 
ing four values of n is q, largest: n = 10, n = 21, n = 53, or 
n = 300? 
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10. If three letters are placed at random in three en- Hint: Let the sequence By, Bo, ... be defined as in Exer- 
velopes, what is the probability that exactly one letter will cise 12 of Sec. 1.5, and show that 
be placed in the correct envelope? 


CO n 
11. Suppose that 10 cards, of which five are red and five Pr (U 4) = lim (U a) = lim Pr(A,). 
are green, are placed at random in 10 envelopes, of which i=1 iis i=l ia 

five are red and five are green. Determine the probability 
that exactly x envelopes will contain a card with a match- 
ing color (x = 0, 1,..., 10). 


13. Let Aj, Ao, ... be an infinite sequence of events such 
that Aj D Az D---. Prove that 


12. Let Aj, Ao, ... be an infinite sequence of events such oa : 
that A; C A) C---. Prove that nA 4) = jim Pr(A,). 
= 7 1 “ Cc Cc 
(U 4) = Jim, Pr(A,,). ie ta the sequence Aj, AS, ..., and apply Exer- 
i=l : 


|.11 Statistical Swindles 


This section presents some examples of how one can be misled by arguments that 
require one to ignore the calculus of probability. 


Misleading Use of Statistics 


The field of statistics has a poor image in the minds of many people because there is 
a widespread belief that statistical data and statistical analyses can easily be manip- 
ulated in an unscientific and unethical fashion in an effort to show that a particular 
conclusion or point of view is correct. We all have heard the sayings that “There 
are three kinds of lies: lies, damned lies, and statistics” (Mark Twain [1924, p. 246] 
says that this line has been attributed to Benjamin Disraeli) and that “you can prove 
anything with statistics.” 

One benefit of studying probability and statistics is that the knowledge we gain 
enables us to analyze statistical arguments that we read in newspapers, magazines, 
or elsewhere. We can then evaluate these arguments on their merits, rather than 
accepting them blindly. In this section, we shall describe three schemes that have been 
used to induce consumers to send money to the operators of the schemes in exchange 
for certain types of information. The first two schemes are not strictly statistical in 
nature, but they are strongly based on undertones of probability. 


Perfect Forecasts 


Suppose that one Monday morning you receive in the mail a letter from a firm 
with which you are not familiar, stating that the firm sells forecasts about the stock 
market for very high fees. To indicate the firm’s ability in forecasting, it predicts that a 
particular stock, or a particular portfolio of stocks, will rise in value during the coming 
week. You do not respond to this letter, but you do watch the stock market during the 
week and notice that the prediction was correct. On the following Monday morning 
you receive another letter from the same firm containing another prediction, this one 
specifying that a particular stock will drop in value during the coming week. Again 
the prediction proves to be correct. 


52 


Chapter 1 Introduction to Probability 


This routine continues for seven weeks. Every Monday morning you receive a 
prediction in the mail from the firm, and each of these seven predictions proves to 
be correct. On the eighth Monday morning, you receive another letter from the firm. 
This letter states that for a large fee the firm will provide another prediction, on 
the basis of which you can presumably make a large amount of money on the stock 
market. How should you respond to this letter? 

Since the firm has made seven successive correct predictions, it would seem that 
it must have some special information about the stock market and is not simply 
guessing. After all, the probability of correctly guessing the outcomes of seven 
successive tosses of a fair coin is only (1/2)’ = 0.008. Hence, if the firm had only been 
guessing each week, then the firm had a probability less than 0.01 of being correct 
seven weeks in a row. 

The fallacy here is that you may have seen only a relatively small number of the 
forecasts that the firm made during the seven-week period. Suppose, for example, 
that the firm started the entire process with a list of 27 = 128 potential clients. On 
the first Monday, the firm could send the forecast that a particular stock will rise in 
value to half of these clients and send the forecast that the same stock will drop in 
value to the other half. On the second Monday, the firm could continue writing to 
those 64 clients for whom the first forecast proved to be correct. It could again send 
a new forecast to half of those 64 clients and the opposite forecast to the other half. 
At the end of seven weeks, the firm (which usually consists of only one person and a 
computer) must necessarily have one client (and only one client) for whom all seven 
forecasts were correct. 

By following this procedure with several different groups of 128 clients, and 
starting new groups each week, the firm may be able to generate enough positive 
responses from clients for it to realize significant profits. 


Guaranteed Winners 


There is another scheme that is somewhat related to the one just described but that is 
even more elegant because of its simplicity. In this scheme, a firm advertises that for 
a fixed fee, usually 10 or 20 dollars, it will send the client its forecast of the winner of 
any upcoming baseball game, football game, boxing match, or other sports event that 
the client might specify. Furthermore, the firm offers a money-back guarantee that 
this forecast will be correct; that is, if the team or person designated as the winner in 
the forecast does not actually turn out to be the winner, the firm will return the full 
fee to the client. 

How should you react to such an advertisement? At first glance, it would appear 
that the firm must have some special knowledge about these sports events, because 
otherwise it could not afford to guarantee its forecasts. Further reflection reveals, 
however, that the firm simply cannot lose, because its only expenses are those for 
advertising and postage. In effect, when this scheme is used, the firm holds the client’s 
fee until the winner has been decided. If the forecast was correct, the firm keeps the 
fee; otherwise, it simply returns the fee to the client. 

On the other hand, the client can very well lose. He presumably purchases the 
firm’s forecast because he desires to bet on the sports event. If the forecast proves to 
be wrong, the client will not have to pay any fee to the firm, but he will have lost any 
money that he bet on the predicted winner. 

Thus, when there are “guaranteed winners,” only the firm is guaranteed to win. 
In fact, the firm knows that it will be able to keep the fees from all the clients for 
whom the forecasts were correct. 
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Improving Your Lottery Chances 


State lotteries have become very popular in America. People spend millions of 
dollars each week to purchase tickets with very small chances of winning medium 
to enormous prizes. With so much money being spent on lottery tickets, it should not 
be surprising that a few enterprising individuals have concocted schemes to cash in 
on the probabilistic naiveté of the ticket-buying public. There are now several books 
and videos available that claim to help lottery players improve their performance. 
People actually pay money for these items. Some of the advice is just common sense, 
but some of it is misleading and plays on subtle misconceptions about probability. 

For concreteness, suppose that we have a game in which there are 40 balls num- 
bered 1 to 40 and six are drawn without replacement to determine the winning 
combination. A ticket purchase requires the customer to choose six different num- 
bers from 1 to 40 and pay a fee. This game has () = 3,838,380 different winning 
combinations and the same number of possible tickets. One piece of advice often 
found in published lottery aids is not to choose the six numbers on your ticket too far 
apart. Many people tend to pick their six numbers uniformly spread out from 1 to 40, 
but the winning combination often has two consecutive numbers or at least two num- 
bers very close together. Some of these “advisors” recommend that, since it is more 
likely that there will be numbers close together, players should bunch some of their 
six numbers close together. Such advice might make sense in order to avoid choosing 
the same numbers as other players in a parimutuel game (i.e., a game in which all 
winners share the jackpot). But the idea that any strategy can improve your chances 
of winning is misleading. 

To see why this advice is misleading, let E be the event that the winning com- 
bination contains at least one pair of consecutive numbers. The reader can calculate 
Pr(£) in Exercise 13 in Sec. 1.12. For this example, Pr(£) = 0.577. So the lottery aids 
are correct that E has high probability. However, by claiming that choosing a ticket in 
E increases your chance of winning, they confuse the probability of the event E with 
the probability of each outcome in E. If you choose the ticket (5, 7, 14, 23, 24, 38), 
your probability of winning is only 1/3,828,380, just as it would be if you chose any 
other ticket. The fact that this ticket happens to be in E doesn’t make your probabil- 
ity of winning equal to 0.577. The reason that Pr(£) is so big is that so many different 
combinations are in E. Each of those combinations still has probability 1/3,828,380 
of winning, and you only get one combination on each ticket. The fact that there are 
so many combinations in E does not make each one any more likely than anything 
else. 


1.12 Supplementary Exercises 


1. Suppose that a coin is tossed seven times. Let A denote 
the event that a head is obtained on the first toss, and let B 
denote the event that a head is obtained on the fifth toss. 
Are A and B disjoint? 


2. If A, B, and D are three events such that Pr(A U B U 
D) = 0.7, what is the value of Pr(AS MN BSN D‘°)? 


3. Suppose that a certain precinct contains 350 voters, of 
which 250 are Democrats and 100 are Republicans. If 30 
voters are chosen at random from the precinct, what is the 
probability that exactly 18 Democrats will be selected? 


4. Suppose that in a deck of 20 cards, each card has one 
of the numbers 1, 2, 3, 4, or 5 and there are four cards 
with each number. If 10 cards are chosen from the deck at 
random, without replacement, what is the probability that 
each of the numbers 1, 2, 3, 4, and 5 will appear exactly 
twice? 


5. Consider the contractor in Example 1.5.4 on page 19. 
He wishes to compute the probability that the total utility 
demand is high, meaning that the sum of water and elec- 
trical demand (in the units of Example 1.4.5) is at least 
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215. Draw a picture of this event on a graph like Fig. 1.5 
or Fig. 1.9 and find its probability. 


6. Suppose that a box contains r red balls and w white 
balls. Suppose also that balls are drawn from the box one 
at a time, at random, without replacement. (a) What is the 
probability that all r red balls will be obtained before any 
white balls are obtained? (b) What is the probability that 
all r red balls will be obtained before two white balls are 
obtained? 


7. Suppose that a box contains r red balls, w white balls, 
and b blue balls. Suppose also that balls are drawn from 
the box one at a time, at random, without replacement. 
What is the probability that all r red balls will be obtained 
before any white balls are obtained? 


8. Suppose that 10 cards, of which seven are red and three 
are green, are put at random into 10 envelopes, of which 
seven are red and three are green, so that each envelope 
contains one card. Determine the probability that exactly 
k envelopes will contain a card with a matching color 
(k =0,1,..., 10). 


9. Suppose that 10 cards, of which five are red and five 
are green, are put at random into 10 envelopes, of which 
seven are red and three are green, so that each envelope 
contains one card. Determine the probability that exactly 
k envelopes will contain a card with a matching color 
(k=0, 1,...2.., 10). 


10. Suppose that the events A and B are disjoint. Under 
what conditions are A‘ and B* disjoint? 


11. Let Aj, A>, and A; be three arbitrary events. Show that 
the probability that exactly one of these three events will 
occur is 


Pr(A,) + Pr(A2) + Pr(A3) 
—2 Pr(Ay al A>) —2 Pr(Ay al A3) —2 Pr(Ao a) A3) 
+ 3 Pr(Ay ial Ad ial A3). 


12. Let Aj, ..., A, be n arbitrary events. Show that the 
probability that exactly one of these n events will occur is 


n 
Y> Pr(Aj) —2 9) Pr(A;N Aj) +3 S> Pr(A; A; N Ay) 
i=l i<j i<j<k 


aati y Prep ii Ag 2S AL) 


13. Consider a state lottery game in which each winning 
combination and each ticket consists of one set of k num- 
bers chosen from the numbers 1 ton without replacement. 
We shall compute the probability that the winning combi- 
nation contains at least one pair of consecutive numbers. 


a. Prove that if n < 2k — 1, then every winning combi- 
nation has at least one pair of consecutive numbers. 
For the rest of the problem, assume that n > 2k — 1. 


b. Let i; <--- <i, be an arbitrary possible winning 
combination arranged in order from smallest to 
largest. For s=1,...,k, let j, =i, — (s — 1). That 
is, 


A=, 
ja =iz-1 
Je =ip — (kK - 1). 


Prove that (i,,..., i,) contains at least one pair of 
consecutive numbers if and only if (jj, ..., j,) con- 
tains repeated numbers. 

c. Prove that 1 < jy) <--- <j, <n —k+1and that the 
number of (j;, ..., j,) Sets with no repeats is aan 


d. Find the probability that there is no pair of consecu- 
tive numbers in the winning combination. 

e. Find the probability of at least one pair of consecu- 
tive numbers in the winning combination. 
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2.1 The Definition of Conditional Probability 


A major use of probability in statistical inference is the updating of probabilities 
when certain events are observed. The updated probability of event A after we 
learn that event B has occurred is the conditional probability of A given B. 


Lottery Ticket. Consider a state lottery game in which six numbers are drawn without 
replacement from a bin containing the numbers 1-30. Each player tries to match the 
set of six numbers that will be drawn without regard to the order in which the numbers 
are drawn. Suppose that you hold a ticket in such a lottery with the numbers 1, 14, 
15, 20, 23, and 27. You turn on your television to watch the drawing but all you see is 
one number, 15, being drawn when the power suddenly goes off in your house. You 
don’t even know whether 15 was the first, last, or some in-between draw. However, 
now that you know that 15 appears in the winning draw, the probability that your 
ticket is a winner must be higher than it was before you saw the draw. How do you 
calculate the revised probability? J 


Example 2.1.1 is typical of the following situation. An experiment is performed 
for which the sample space S is given (or can be constructed easily) and the proba- 
bilities are available for all of the events of interest. We then learn that some event B 
has occuured, and we want to know how the probability of another event A changes 
after we learn that B has occurred. In Example 2.1.1, the event that we have learned 
is B = {one of the numbers drawn is 15}. We are certainly interested in the probabil- 
ity of 


A = {the numbers 1, 14, 15, 20, 23, and 27 are drawn}, 


and possibly other events. 

If we know that the event B has occurred, then we know that the outcome of 
the experiment is one of those included in B. Hence, to evaluate the probability that 
A will occur, we must consider the set of those outcomes in B that also result in 
the occurrence of A. As sketched in Fig. 2.1, this set is precisely the set AM B. It is 
therefore natural to calculate the revised probability of A according to the following 
definition. 
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Figure 2.1 The outcomes in 
the event B that also belong 


to the event A. 
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S 


ANB 


Conditional Probability. Suppose that we learn that an event B has occurred and that 
we wish to compute the probability of another event A taking into account that 
we know that B has occurred. The new probability of A is called the conditional 
probability of the event A given that the event B has occurred and is denoted Pr(A|B). 
If Pr(B) > 0, we compute this probability as 


Pr(An B) 
Pr(B) 
The conditional probability Pr(A|B) is not defined if Pr(B) = 0. 


Pr(A|B) = (2.1.1) 


For convenience, the notation in Definition 2.1.1 is read simply as the conditional 
probability of A given B. Eq. (2.1.1) indicates that Pr(A|B) is computed as the 
proportion of the total probability Pr(B) that is represented by Pr(A /M B), intuitively 
the proportion of B that is also part of A. 


Lottery Ticket. In Example 2.1.1, you learned that the event 
B = {one of the numbers drawn is 15} 


has occurred. You want to calculate the probability of the event A that your ticket 
is a winner. Both events A and B are expressible in the sample space that consists of 
the () = 30!/(6!24!) possible combinations of 30 items taken six at a time, namely, 
the unordered draws of six numbers from 1-30. The event B consists of combinations 
that include 15. Since there are 29 remaining numbers from which to choose the other 


five in the winning draw, there are (| outcomes in B. It follows that 


Pr(B) = (3) _ 2912416! _ 
~ (0) 30151241 


The event A that your ticket is a winner consists of a single outcome that is also in B, 
soAMNB=A, and 


124! 
Pr(A M B) = Pr(A) = 1 _ Oat _ 168 x 10- 
() 30! 

6 

It follows that the conditional probability of A given B is 
6124! 

Pr(A|B) = 22 =8.4 x 10-©. 
0.2 

This is five times as large as Pr(A) before you learned that B had occurred. < 


Definition 2.1.1 for the conditional probability Pr(A|B) is worded in terms of 
the subjective interpretation of probability in Sec. 1.2. Eq. (2.1.1) also has a simple 
meaning in terms of the frequency interpretation of probability. According to the 


Example 


2.1.3 


Example 
2.1.4 
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frequency interpretation, if an experimental process is repeated a large number of 
times, then the proportion of repetitions in which the event B will occur is approx- 
imately Pr(B) and the proportion of repetitions in which both the event A and the 
event B will occur is approximately Pr(A MN B). Therefore, among those repetitions 
in which the event B occurs, the proportion of repetitions in which the event A will 
also occur is approximately equal to 
ry ee cea ae 
Pr(B) 

Rolling Dice. Suppose that two dice were rolled and it was observed that the sum T of 
the two numbers was odd. We shall determine the probability that T was less than 8. 

If we let A be the event that T < 8 and let B be the event that T is odd, then 
AN Bis the event that T is 3, 5, or 7. From the probabilities for two dice given at the 
end of Sec. 1.6, we can evaluate Pr(A M B) and Pr(B) as follows: 


2 4 62 1 
Pr(An B)= 4 = 5 
WAIVE) =e gg Taq 36 7 3 
2 4 6 4 2 18 1 


Pr(B) = i—3 = 
WEY a6 a6 aa" 36° G6 56 


Hence, 


Pr(ANB) 2 
Pr(AlB) = OOS = < 


A Clinical Trial. It is very common for patients with episodes of depression to have 
a recurrence within two to three years. Prien et al. (1984) studied three treatments 
for depression: imipramine, lithium carbonate, and a combination. As is traditional 
in such studies (called clinical trials), there was also a group of patients who received 
a placebo. (A placebo is a treatment that is supposed to be neither helpful nor 
harmful. Some patients are given a placebo so that they will not know that they 
did not receive one of the other treatments. None of the other patients knew which 
treatment or placebo they received, either.) In this example, we shall consider 150 
patients who entered the study after an episode of depression that was classified 
as “unipolar” (meaning that there was no manic disorder). They were divided into 
the four groups (three treatments plus placebo) and followed to see how many had 
recurrences of depression. Table 2.1 summarizes the results. If a patient were selected 
at random from this study and it were found that the patient received the placebo 
treatment, what is the conditional probability that the patient had a relapse? Let 
B be the event that the patient received the placebo, and let A be the event that 


Table 2.1 Results of the clinical depression study in Example 2.1.4 


Treatment group 


Response Imipramine Lithium Combination Placebo Total 


Relapse 18 13 22 24 77 
No relapse 22 25 16 10 73 


Total 40 38 38 34 150 
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the patient had a relapse. We can calculate Pr(B) = 34/150 and Pr(A N B) = 24/150 
directly from the table. Then Pr(A|B) = 24/34 = 0.706. On the other hand, if the 
randomly selected patient is found to have received lithium (call this event C) then 
Pr(C) = 38/150, Pr(A N C) = 13/150, and Pr(A|C) = 13/38 = 0.342. Knowing which 
treatment a patient received seems to make a difference to the probability of relapse. 
In Chapter 10, we shall study methods for being more precise about how much of a 
difference it makes. 4 


Rolling Dice Repeatedly. Suppose that two dice are to be rolled repeatedly and the 
sum T of the two numbers is to be observed for each roll. We shall determine the 
probability p that the value T = 7 will be observed before the value T = 8 is observed. 

The desired probability p could be calculated directly as follows: We could 
assume that the sample space S contains all sequences of outcomes that terminate as 
soon as either the sum T =7 or the sum T = 8 is obtained. Then we could find the 
sum of the probabilities of all the sequences that terminate when the value T = 7 is 
obtained. 

However, there is a simpler approach in this example. We can consider the simple 
experiment in which two dice are rolled. If we repeat the experiment until either the 
sum T =7 or the sum T = 8 is obtained, the effect is to restrict the outcome of the 
experiment to one of these two values. Hence, the problem can be restated as follows: 
Given that the outcome of the experiment is either T = 7 or T = 8, determine the 
probability p that the outcome is actually T =7. 

If we let A be the event that T = 7 and let B be the event that the value of T is 
either 7 or 8, then AN B = A and 


Pr(AN B) _ Pr(A) 
Pr(B) —~Pr(B) 


From the probabilities for two dice given in Example 1.6.5, Pr(A) = 6/36 and 
Pr(B) = (6/36) + (5/36) = 11/36. Hence, p = 6/11. <J 


p = Pr(A|B) = 


The Multiplication Rule for Conditional Probabilities 


In some experiments, certain conditional probabilities are relatively easy to assign 
directly. In these experiments, it is then possible to compute the probability that both 
of two events occur by applying the next result that follows directly from Eq. (2.1.1) 
and the analogous definition of Pr(B|A). 


Multiplication Rule for Conditional Probabilities. Let A and B be events. If Pr(B) > 0, 
then 


Pr(A M B) = Pr(B) Pr(A|B). 
If Pr(A) > 0, then 
Pr(A 9 B) = Pr(A) Pr(B|A). | 


Selecting Two Balls. Suppose that two balls are to be selected at random, without 
replacement, from a box containing r red balls and b blue balls. We shall determine 
the probability p that the first ball will be red and the second ball will be blue. 

Let A be the event that the first ball is red, and let B be the event that the second 
ball is blue. Obviously, Pr(A) = r/(r + b). Furthermore, if the event A has occurred, 
then one red ball has been removed from the box on the first draw. Therefore, the 
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probability of obtaining a blue ball on the second draw will be 

Pr(B|A) = ———_.. 

men r+b-1 
It follows that 

r b 


Pr(AN B) = _——_—_., 
r+b r+b-1 


< 


The principle that has just been applied can be extended to any finite number of 
events, as stated in the following theorem. 


Multiplication Rule for Conditional Probabilities. Suppose that A,, Ao,..., A, are 
events such that Pr(A,;N A2M---NA,_1) > 0. Then 
Pr(A, N Ad Oe A,) (2.1.2) 


= Pr(Aj) Pr(Ag|Aq) Pr(A3|A1 9 Ag) - + Pr(A,|Ay Ag N- ++ Ay_y). 


Proof The product of probabilities on the right side of Eq. (2.1.2) is equal to 
Pr(A, 2 Ad) ; Pr(A, MN Ad M A3) _ Pr(AyN AN: 7 -NA,) 
Pr(A)) Pr(A, Ad) Pe(Ay fi Ag ++i Ay-a) 


Pr(Ay)- 


Since Pr(A, 9 A2N---MA,_1) > 0, each of the denominators in this product must be 
positive. All of the terms in the product cancel each other except the final numerator 
Pr(A, 1 A,N---NA,,), which is the left side of Eq. (2.1.2). r 


Selecting Four Balls. Suppose that four balls are selected one at a time, without 
replacement, from a box containing r red balls and b blue balls (r > 2, b > 2). We 
shall determine the probability of obtaining the sequence of outcomes red, blue, red, 
blue. 

If we let R; denote the event that a red ball is obtained on the jth draw and let 


B, denote the event that a blue ball is obtained on the jth draw (j =1,..., 4), then 
Pr(Ry ia By N R3 (a By) = Pr(R}) Pr(B>|R1) Pr(R3| Ry N Bp) Pr(B4| Ry N Bo N R3) 
r b r—-1 b-1 


_ < 
r+b r+b-1 r+b-2 r+b-3 


Note: Conditional Probabilities Behave Just Like Probabilities. In all of the sit- 
uations that we shall encounter in this text, every result that we can prove has a 
conditional version given an event B with Pr(B) > 0. Just replace all probabilities by 
conditional probabilities given B and replace all conditional probabilities given other 
events C by conditional probabilities given C N B. For example, Theorem 1.5.3 says 
that Pr(A‘°) = 1 — Pr(A). It is easy to prove that Pr(A‘°|B) = 1 — Pr(A|B) if Pr(B) > 0. 
(See Exercises 11 and 12 in this section.) Another example is Theorem 2.1.3, which 
is a conditional version of the multiplication rule Theorem 2.1.2. Although a proof is 
given for Theorem 2.1.3, we shall not provide proofs of all such conditional theorems, 
because their proofs are generally very similar to the proofs of the unconditional 
versions. 


Suppose that Aj, A>,..., A,, Bare events such that Pr(B) > Oand Pr(A,;N AN ---/ 
A,—1|B) > 0. Then 
Pr(AyN AdN---NA,|B) = Pr(A,|B) Pr(A2|A,N B)--- 


(2.1.3) 
x Pr(A,,|A1 N A> M--- An-1 N B). 
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Figure 2.2 The inter- 
sections of A with events 
B,,..., Bs of a partition in 
the proof of Theorem 2.1.4. 


Proof The product of probabilities on the right side of Eq. (2.1.3) is equal to 
Pr(A;N B) Pr(AyNA2NB) —_Pr(AyN AQ N--:NAnB) 
Pr(B) Pr(A, 1 B) Pr(A, 1 Az-+-NA,_1 MB) 
Since Pr(A, NM A2N---O A,_1|B) > 0, each of the denominators in this product must 
be positive. All of the terms in the product cancel each other except the first denom- 


inator and the final numerator to yield Pr(A; MN A2N---NA,,N B)/ Pr(B), which is 
the left side of Eq. (2.1.3). r 


Conditional Probability and Partitions 


Theorem 1.4.11 shows how to calculate the probability of an event by partitioning 
the sample space into two events B and B°. This result easily generalizes to larger 
partitions, and when combined with Theorem 2.1.1 it leads to a very powerful tool 
for calculating probabilities. 


Partition. Let S denote the sample space of some experiment, and consider k events 
B,,..., By in S such that B,,..., By are disjoint and az B; = S. It is said that these 
events form a partition of S. 


Typically, the events that make up a partition are chosen so that an important 
source of uncertainty in the problem is reduced if we learn which event has occurred. 


Selecting Bolts. Two boxes contain long bolts and short bolts. Suppose that one box 
contains 60 long bolts and 40 short bolts, and that the other box contains 10 long bolts 
and 20 short bolts. Suppose also that one box is selected at random and a bolt is then 
selected at random from that box. We would like to determine the probability that 
this bolt is long. <j 


Partitions can facilitate the calculations of probabilities of certain events. 


Law of total probability. Suppose that the events B,,..., B, form a partition of the 
space S and Pr(B;) > 0 for j =1,..., k. Then, for every event A in S, 
k 
Pr(A) = > Pr(B;) Pr(A|B,). (2.1.4) 
j=l 
Proof Theevents B}N A, ByNA,..., B,M Awill forma partition of A, as illustrated 


in Fig. 2.2. Hence, we can write 


A=(B,N A) U (By) NA) U---U(B, NA). 


Example 
2.1.9 


Example 
2.1.10 


2.1 The Definition of Conditional Probability 61 


Furthermore, since the k events on the right side of this equation are disjoint, 
k 
Pr(A) = > Pr(B, MA). 
j=l 
Finally, if Pr(B;) > 0 for j =1,...,k, then Pr(B; 1 A) = Pr(B;) Pr(A|B,) and it 
follows that Eq. (2.1.4) holds. | 


Selecting Bolts. In Example 2.1.8, let B; be the event that the first box (the one with 
60 long and 40 short bolts) is selected, let B, be the event that the second box (the 
one with 10 long and 20 short bolts) is selected, and let A be the event that a long 
bolt is selected. Then 


Pr(A) = Pr(B,) Pr(A|B,) + Pr(B) Pr(A|B3). 


Since a box is selected at random, we know that Pr(B,) = Pr(B2) = 1/2. Fur- 
thermore, the probability of selecting a long bolt from the first box is Pr(A|B,) = 
60/100 = 3/5, and the probability of selecting a long bolt from the second box is 
Pr(A|B,) = 10/30 = 1/3. Hence, 

3 

Pr(A) = F + 

Achieving a High Score. Suppose that a person plays a game in which his score must be 

one of the 50 numbers 1, 2, . .. , 50 and that each of these 50 numbers is equally likely 

to be his score. The first time he plays the game, his score is X. He then continues to 

play the game until he obtains another score Y such that Y > X. We will assume that, 

conditional on previous plays, the 50 scores remain equally likely on all subsequent 
plays. We shall determine the probability of the event A that Y = 50. 

For each i =1,..., 50, let B; be the event that X¥ =i. Conditional on B,, the 


value of Y is equally likely to be any one of the numbers i, i + 1, ..., 50. Since each 
of these (51 — 7) possible values for Y is equally likely, it follows that 


Pr(A|B;) = Pr(Y = 50|B;) = ——. 
51-i 

Furthermore, since the probability of each of the 50 values of X is 1/50, it follows that 
Pr(B;) = 1/50 for all i and 


50 
1 1 1 1 1 1 
Pr(A) = : = 1 -++-+— } = 0.0900. < 
“) 2 5 51-i a(ttg+et +3) 


Note: Conditional Version of Law of Total Probability. The law of total probability 
has an analog conditional on another event C, namely, 
k 
Pr(A|C) = ) > Pr(B,|C) Pr(A|B; 1C). (2.1.5) 
j=l 


The reader can prove this in Exercise 17. 


Augmented Experiment In some experiments, it may not be clear from the initial 
description of the experiment that a partition exists that will facilitate the calculation 
of probabilities. However, there are many such experiments in which such a partition 
exists if we imagine that the experiment has some additional structure. Consider the 
following modification of Examples 2.1.8 and 2.1.9. 
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Selecting Bolts. There is one box of bolts that contains some long and some short 
bolts. A manager is unable to open the box at present, so she asks her employees 
what is the composition of the box. One employee says that it contains 60 long bolts 
and 40 short bolts. Another says that it contains 10 long bolts and 20 short bolts. 
Unable to reconcile these opinions, the manager decides that each of the employees 
is correct with probability 1/2. Let B, be the event that the box contains 60 long and 
40 short bolts, and let By be the event that the box contains 10 long and 20 short 
bolts. The probability that the first bolt selected is long is now calculated precisely as 
in Example 2.1.9. <j 


In Example 2.1.11, there is only one box of bolts, but we believe that it has one 
of two possible compositions. We let the events B,; and B, determine the possible 
compositions. This type of situation is very common in experiments. 


A Clinical Trial. Consider a clinical trial such as the study of treatments for depression 
in Example 2.1.4. As in many such trials, each patient has two possible outcomes, 
in this case relapse and no relapse. We shall refer to relapse as “failure” and no 
relapse as “success.” For now, we shall consider only patients in the imipramine 
treatment group. If we knew the effectiveness of imipramine, that is, the proportion 
p of successes among all patients who might receive the treatment, then we might 
model the patients in our study as having probability p of success. Unfortunately, we 
do not know p at the start of the trial. In analogy to the box of bolts with unknown 
composition in Example 2.1.11, we can imagine that the collection of all available 
patients (from which the 40 imipramine patients in this trial were selected) has two or 
more possible compositions. We can imagine that the composition of the collection of 
patients determines the proportion that will be success. For simplicity, in this example, 
we imagine that there are 11 different possible compositions of the collection of 
patients. In particular, we assume that the proportions of success for the 11 possible 
compositions are 0, 1/10, ..., 9/10, 1. (We shall be able to handle more realistic 
models for p in Chapter 3.) For example, if we knew that our patients were drawn 
from a collection with the proportion 3/10 of successes, we would be comfortable 
saying that the patients in our sample each have success probability p = 3/10. The 
value of p is animportant source of uncertainty in this problem, and we shall partition 
the sample space by the possible values of p. For j =1,..., 11, let B; be the event 
that our sample was drawn from a collection with proportion (j — 1)/10 of successes. 
We can also identify B; as the event {p = (j — 1)/10}. 

Now, let E, be the event that the first patient in the imipramine group has a 
success. We defined each event B; so that Pr(£;|B;) = (j — 1)/10. Supppose that, 
prior to starting the trial, we believe that Pr(B;) = 1/11 for each j. It follows that 


ul 
1j-1 55 1 
Pr(£,) = = =-, 2.1.6 
“ XTi 10 1102 ene) 
where the second equality uses the fact that )0"_, j =n(n + 1)/2. < 
The events B,, By,..., By, in Example 2.1.12 can be thought of in much the 


same way as the two events B, and B, that determine the mixture of long and short 
bolts in Example 2.1.11. There is only one box of bolts, but there is uncertainty about 
its composition. Similarly in Example 2.1.12, there is only one group of patients, 
but we believe that it has one of 11 possible compositions determined by the events 
B,, Bo, ..., By. To call these events, they must be subsets of the sample space for the 
experiment in question. That will be the case in Example 2.1.12 if we imagine that 


, 
“ 
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the experiment consists not only of observing the numbers of successes and failures 
among the patients but also of potentially observing enough additional patients to 
be able to compute p, possibly at some time very far in the future. Similarly, in 
Example 2.1.11, the two events B, and B, are subsets of the sample space if we 
imagine that the experiment consists not only of observing one sample bolt but also 
of potentially observing the entire composition of the box. 

Throughout the remainder of this text, we shall implicitly assume that experi- 
ments are augmented to include outcomes that determine the values of quantities 
such as p. We shall not require that we ever get to observe the complete outcome of 
the experiment so as to tell us precisely what p is, but merely that there is an exper- 
iment that includes all of the events of interest to us, including those that determine 
quantities like p. 


Definition Augmented Experiment. If desired, any experiment can be augmented to include the 
2.1.3 potential or hypothetical observation of as much additional information as we would 
find useful to help us calculate any probabilities that we desire. 


Definition 2.1.3 is worded somewhat vaguely because it is intended to cover a 
wide variety of cases. Here is an explicit application to Example 2.1.12. 


Example A Clinical Trial. In Example 2.1.12, we could explicitly assume that there exists an 
2.1.13 infinite sequence of patients who could be treated with imipramine even though 
we will observe only finitely many of them. We could let the sample space consist 

of infinite sequences of the two symbols S and F such as (S, S, F, S, F, F, F,...). 

Here S in coordinate i means that the ith patient is a success, and F stands for 

failure. So, the event E, in Example 2.1.12 is the event that the first coordinate 

is S. The example sequence above is then in the event £,. To accommodate our 
interpretation of p as the proportion of successes, we can assume that, for every 

such sequence, the proportion of S’s among the first n coordinates gets close to one 

of the numbers 0, 1/10, ..., 9/10, 1 as 7 increases. In this way, p is explicitly the limit 

of the proportion of successes we would observe if we could find a way to observe 
indefinitely. In Example 2.1.12, B, is the event consisting of all the outcomes in which 

the limit of the proportion of S’s equals 1/10, B3 is the set of outcomes in which 

the limit is 2/10, etc. Also, we observe only the first 40 coordinates of the infinite 

sequence, but we still behave as if p exists and could be determined if only we could 

observe forever. <1 


In the remainder of the text, there will be many experiments that we assume 
are augmented. In such cases, we will mention which quantities (such as p in Exam- 
ple 2.1.13) would be determined by the augmented part of the experiment even if we 
do not explicitly mention that the experiment is augmented. 


The Game of Craps 


We shall conclude this section by discussing a popular gambling game called craps. 
One version of this game is played as follows: A player rolls two dice, and the sum 
of the two numbers that appear is observed. If the sum on the first roll is 7 or 11, 
the player wins the game immediately. If the sum on the first roll is 2, 3, or 12, the 
player loses the game immediately. If the sum on the first roll is 4, 5, 6, 8, 9, or 10, 
then the two dice are rolled again and again until the sum is either 7 or the original 
value. If the original value is obtained a second time before 7 is obtained, then the 
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player wins. If the sum 7 is obtained before the original value is obtained a second 
time, then the player loses. 

We shall now compute the probability Pr(W), where W is the event that the 
player will win. Let the sample space S consist of all possible sequences of sums from 
the rolls of dice that might occur in a game. For example, some of the elements of S are 
(4, 7), AD, (4, 3, 4), (12), (10, 8, 2, 12, 6, 7), etc. We see that (11) € W but (4, 7) € WS, 
etc.. We begin by noticing that whether or not an outcome is in W depends in a crucial 
way on the first roll. For this reason, it makes sense to partition W according to the 
sum on the first roll. Let B; be the event that the first roll isi fori =2,..., 12. 

Theorem 2.1.4 tells us that Pr(W) = ss Pr(B;) Pr(W|B;). Since Pr(B;) for each 
i was computed in Example 1.6.5, we need to determine Pr(W|B;) for each i. We 
begin with i = 2. Because the player loses if the first roll is 2, we have Pr(W|B) = 0. 
Similarly, Pr(W|B3) = 0 = Pr(W|B,2). Also, Pr(W|B7) = 1 because the player wins if 
the first roll is 7. Similarly, Pr(W|B,,) = 1. 

For each first roll i € {4, 5, 6, 8, 9, 10}, Pr(W|B;) is the probability that, in a 
sequence of dice rolls, the sum i will be obtained before the sum 7 is obtained. As 
described in Example 2.1.5, this probability is the same as the probability of obtaining 
the sum i when the sum must be either i or 7. Hence, 


Pr(B; 
Preonais— 
Pr(B; U By) 
We compute the necessary values here: 
x 1 2 
Pr(W|Bs) = = 2 a= P(W|Bs) = = el eae 
36 + 36 36 + 36 
% % 5 
Pr(W|Be) => 5 a =, Pr(W|Bg) = 5 = 3 =, 
aotae it ou ee 
za 2 2 1 
Pr(W|Bo) = yor = Pr(W|Byo) = = 36 55 
36 + 36 36 + 36 


Finally, we compute the sum S Pr(B;) Pr(wW|B;): 


12 
31 42 55 6 
se 2 PE) PED 19+ 363+ 365° 3611’ 36 


5 5 42 31 2 2928 
0 = — = (0.493. 
3611 365 363 36° 5940 


Thus, the probability of winning in the game of craps is slightly less than 1/2. 


7 
“ 


Summary 


The revised probability of an event A after learning that event B (with Pr(B) > 0) 
has occurred is the conditional probability of A given B, denoted by Pr(A|B) and 
computed as Pr(A NM B)/ Pr(B). Often it is easy to assess a conditional probability, 
such as Pr(A|B), directly. In such a case, we can use the multiplication rule for con- 
ditional probabilities to compute Pr(A N B) = Pr(B) Pr(A|B). All probability results 
have versions conditional on an event B with Pr(B) > 0: Just change all probabili- 
ties so that they are conditional on B in addition to anything else they were already 
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conditional on. For example, the multiplication rule for conditional probabilities be- 
comes Pr(A, 1M A>|B) = Pr(Aq|B) Pr(Az|A,/ B). A partition is a collection of disjoint 
events whose union is the whole sample space. To be most useful, a partition is cho- 
sen so that an important source of uncertainty is reduced if we learn which one of 
the partition events occurs. If the conditional probability of an event A is available 
given each event in a partition, the law of total probability tells how to combine these 
conditional probabilities to get Pr(A). 


Exercises 


1. If A Cc B with Pr(B) > 0, what is the value of Pr(A|B)? 


2. If A and B are disjoint events and Pr(B) > 0, what is 
the value of Pr(A|B)? 


3. If S is the sample space of an experiment and A is any 
event in that space, what is the value of Pr(A|S)? 


4. Each time a shopper purchases a tube of toothpaste, 
he chooses either brand A or brand B. Suppose that for 
each purchase after the first, the probability is 1/3 that he 
will choose the same brand that he chose on his preceding 
purchase and the probability is 2/3 that he will switch 
brands. If he is equally likely to choose either brand A 
or brand B on his first purchase, what is the probability 
that both his first and second purchases will be brand A 
and both his third and fourth purchases will be brand B? 


5. A box contains r red balls and b blue balls. One ball 
is selected at random and its color is observed. The ball 
is then returned to the box and k additional balls of the 
same color are also put into the box. A second ball is then 
selected at random, its color is observed, and it is returned 
to the box together with k additional balls of the same 
color. Each time another ball is selected, the process is 
repeated. If four balls are selected, what is the probability 
that the first three balls will be red and the fourth ball will 
be blue? 


6. A box contains three cards. One card is red on both 
sides, one card is green on both sides, and one card is red 
on one side and green on the other. One card is selected 
from the box at random, and the color on one side is 
observed. If this side is green, what is the probability that 
the other side of the card is also green? 


7. Consider again the conditions of Exercise 2 of Sec. 1.10. 
If a family selected at random from the city subscribes to 
newspaper A, what is the probability that the family also 
subscribes to newspaper B? 


8. Consider again the conditions of Exercise 2 of Sec. 1.10. 
If a family selected at random from the city subscribes to 
at least one of the three newspapers A, B, and C, what is 
the probability that the family subscribes to newspaper A? 


9. Suppose that a box contains one blue card and four red 
cards, which are labeled A, B, C, and D. Suppose also that 


two of these five cards are selected at random, without 
replacement. 


a. If it is known that card A has been selected, what is 
the probability that both cards are red? 


b. If it is known that at least one red card has been 
selected, what is the probability that both cards are 
red? 


10. Consider the following version of the game of craps: 
The player rolls two dice. If the sum on the first roll is 
7 or 11, the player wins the game immediately. If the 
sum on the first roll is 2,3, or 12, the player loses the 
game immediately. However, if the sum on the first roll 
is 4, 5, 6, 8, 9, or 10, then the two dice are rolled again and 
again until the sum is either 7 or 11 or the original value. If 
the original value is obtained a second time before either 
7 or 11 is obtained, then the player wins. If either 7 or 11 
is obtained before the original value is obtained a second 
time, then the player loses. Determine the probability that 
the player will win this game. 


11. For any two events A and B with Pr(B) > 0, prove that 
Pr(A°|B) =1—Pr(A|B). 


12. For any three events A, B, and D, such that Pr(D) > 0, 
prove that Pr(A U B|D) = Pr(A|D) + Pr(B|D) — Pr(An 
B\|D). 


13. A box contains three coins with a head on each side, 
four coins with a tail on each side, and two fair coins. If 
one of these nine coins is selected at random and tossed 
once, what is the probability that a head will be obtained? 


14. A machine produces defective parts with three differ- 
ent probabilities depending on its state of repair. If the 
machine is in good working order, it produces defective 
parts with probability 0.02. If it is wearing down, it pro- 
duces defective parts with probability 0.1. Ifit needs main- 
tenance, it produces defective parts with probability 0.3. 
The probability that the machine is in good working order 
is 0.8, the probability that it is wearing down is 0.1, and the 
probability that it needs maintenance is 0.1. Compute the 
probability that a randomly selected part will be defective. 
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15. The percentages of voters classed as Liberals in three 
different election districts are divided as follows: in the 
first district, 21 percent; in the second district, 45 percent; 
and in the third district, 75 percent. If a district is selected 
at random and a voter is selected at random from that 
district, what is the probability that she will be a Liberal? 


16. Consider again the shopper described in Exercise 4. 
On each purchase, the probability that he will choose the 


same brand of toothpaste that he chose on his preced- 
ing purchase is 1/3, and the probability that he will switch 
brands is 2/3. Suppose that on his first purchase the proba- 
bility that he will choose brand A is 1/4 and the probability 
that he will choose brand B is 3/4. What is the probability 
that his second purchase will be brand B? 


17. Prove the conditional version of the law of total prob- 
ability (2.1.5). 


2.2 Independent Events 


If learning that B has occurred does not change the probability of A, then we say 
that A and B are independent. There are many cases in which events A and B 
are not independent, but they would be independent if we learned that some other 
event C had occurred. In this case, A and B are conditionally independent given C. 


Tossing Coins. Suppose that a fair coin is tossed twice. The experiment has four 
outcomes, HH, HT, TH, and TT, that tell us how the coin landed on each of the 
two tosses. We can assume that this sample space is simple so that each outcome has 
probability 1/4. Suppose that we are interested in the second toss. In particular, we 
want to calculate the probability of the event A = {H on second toss}. We see that A = 
{HH,TH}, so that Pr(A) = 2/4 = 1/2. If we learn that the first coin landed T, we might 
wish to compute the conditional probability Pr(A|B) where B = {T on first toss}. 
Using the definition of conditional probability, we easily compute 


PrANB) 1/4 1 
Pr(B) 1/2. 2 


because AM B = {TH} has probability 1/4. We see that Pr(A|B) = Pr(A); hence, we 
don’t change the probability of A even after we learn that B has occurred. < 


> 


The conditional probability of the event A given that the event B has occurred is 
the revised probability of A after we learn that B has occurred. It might be the case, 
however, that no revision is necessary to the probability of A even after we learn that 
B occurs. This is precisely what happened in Example 2.2.1. In this case, we say that 
A and B are independent events. As another example, if we toss a coin and then roll 
a die, we could let A be the event that the die shows 3 and let B be the event that the 
coin lands with heads up. If the tossing of the coin is done in isolation of the rolling 
of the die, we might be quite comfortable assigning Pr(A|B) = Pr(A) = 1/6. In this 


In general, if Pr(B) > 0, the equation Pr(A|B) = Pr(A) can be rewritten as Pr(A 1 
B)/ Pr(B) = Pr(A). If we multiply both sides of this last equation by Pr(B), we obtain 
the equation Pr(A N B) = Pr(A) Pr(B). In order to avoid the condition Pr(B) > 0, the 
mathematical definition of the independence of two events is stated as follows: 


Independent Events. Two events A and B are independent if 


Example 
2.2.1 
Pr(A|B) = 
Definition of Independence 
case, we say that A and B are independent events. 
Definition 
2.2.1 


Pr(A QB) = Pr(A) Pr(B). 


Example 
2.2.2 


Example 
2.2.3 
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Suppose that Pr(A) > 0 and Pr(B) > 0. Then it follows easily from the definitions 
of independence and conditional probability that A and B are independent if and only 
if Pr(A|B) = Pr(A) and Pr(B|A) = Pr(B). 


Independence of Two Events 


If two events A and B are considered to be independent because the events are 
physically unrelated, and if the probabilities Pr(A) and Pr(B) are known, then the 
definition can be used to assign a value to Pr(AN B). 


Machine Operation. Suppose that two machines 1 and 2 in a factory are operated in- 
dependently of each other. Let A be the event that machine 1 will become inoperative 
during a given 8-hour period, let B be the event that machine 2 will become inopera- 
tive during the same period, and suppose that Pr(A) = 1/3 and Pr(B) = 1/4. We shall 
determine the probability that at least one of the machines will become inoperative 
during the given period. 

The probability Pr(A M B) that both machines will become inoperative during 


the period is 
1\ /1 1 
Pr(A N B) = Pr(A) Pr(B) = (:) (z) = 55 


Therefore, the probability Pr(A U B) that at least one of the machines will become 
inoperative during the period is 
Pr(A U B) = Pr(A) + Pr(B) — Pr(AN B) 
1 1 1 1 


—~+4+-—--—-—=-, < 
374 12 2 


The next example shows that two events A and B, which are physically related, 
can, nevertheless, satisfy the definition of independence. 


Rolling a Die. Suppose that a balanced die is rolled. Let A be the event that an even 
number is obtained, and let B be the event that one of the numbers 1, 2, 3, or 4 is 
obtained. We shall show that the events A and B are independent. 

In this example, Pr(A) = 1/2 and Pr(B) = 2/3. Furthermore, since A B is the 
event that either the number 2 or the number 4 is obtained, Pr(A N B) = 1/3. Hence, 
Pr(A NM B) = Pr(A) Pr(B). It follows that the events A and B are independent events, 
even though the occurrence of each event depends on the same roll of a die. < 


The independence of the events A and B in Example 2.2.3 can also be interpreted 
as follows: Suppose that a person must bet on whether the number obtained on the 
die will be even or odd, that is, on whether or not the event A will occur. Since three 
of the possible outcomes of the roll are even and the other three are odd, the person 
will typically have no preference between betting on an even number and betting on 
an odd number. 

Suppose also that after the die has been rolled, but before the person has learned 
the outcome and before she has decided whether to bet on an even outcome or on an 
odd outcome, she is informed that the actual outcome was one of the numbers 1, 2, 3, 
or 4, i.e., that the event B has occurred. The person now knows that the outcome was 
1, 2, 3, or 4. However, since two of these numbers are even and two are odd, the 
person will typically still have no preference between betting on an even number 
and betting on an odd number. In other words, the information that the event B has 


68 


Chapter 2 Conditional Probability 


Theorem 
2.2.1 


Definition 
2.2.2 


occurred is of no help to the person who is trying to decide whether or not the event 
A has occurred. 


Independence of Complements In the foregoing discussion of independent events, 
we stated that if A and B are independent, then the occurrence or nonoccurrence of 
A should not be related to the occurrence or nonoccurrence of B. Hence, if A and 
B satisfy the mathematical definition of independent events, then it should also be 
true that A and B° are independent events, that A° and B are independent events, 
and that A‘ and B° are independent events. One of these results is established in the 
next theorem. 


If two events A and B are independent, then the events A and B° are also indepen- 
dent. 


Proof Theorem 1.5.6 says that 
Pr(A N B*) = Pr(A) — Pr(AN B). 
Furthermore, since A and B are independent events, Pr(A M B) = Pr(A) Pr(B). It 
now follows that 
Pr(A M BS) = Pr(A) — Pr(A) Pr(B) = Pr(A)[1 — Pr(B)] 
= Pr(A) Pr(B‘). 


Therefore, the events A and B° are independent. a 


The proof of the analogous result for the events A‘ and B is similar, and the proof 
for the events A° and B° is required in Exercise 2 at the end of this section. 


Independence of Several Events 


The definition of independent events can be extended to any number of events, 
Aj,..., Ax. Intuitively, if learning that some of these events do or do not occur does 
not change our probabilities for any events that depend only on the remaining events, 
we would say that all k events are independent. The mathematical definition is the 
following analog to Definition 2.2.1. 


(Mutually) Independent Events. The k events Aj, ..., Ay are independent (or mutually 
independent) if, for every subset Aj,, ..., Ai, of j of these events (j = 2, 3,...,k), 


Pr(A;, N---A Aj,) => Prta;) ie cod Pr(A;,). 


As an example, in order for three events A, B, and C to be independent, the following 
four relations must be satisfied: 


Pr(A NM B) = Pr(A) Pr(B), 
Pr(A MC) = Pr(A) Pr(C), (2.2.1) 
Pr(B 1 C) = Pr(B) Pr(C), 
and 
Pr(A NM BOC) = Pr(A) Pr(B) Pr(C). (2.2.2) 


It is possible that Eq. (2.2.2) will be satisfied, but one or more of the three rela- 
tions (2.2.1) will not be satisfied. On the other hand, as is shown in the next example, 


Example 
2.2.4 


Example 
2.2.5 


Example 
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it is also possible that each of the three relations (2.2.1) will be satisfied but Eq. (2.2.2) 
will not be satisfied. 


Pairwise Independence. Suppose that a fair coin is tossed twice so that the sample 
space S = {HH, HT, TH, TT} is simple. Define the following three events: 


A = {H on first toss} = {HH, HT}, 

B=({Hon second toss} = {HH, TH}, and 

C = {Both tosses the same} = {HH, TT}. 
Then ANB=ANC=BNC=ANBNCH={H#F}. Hence, 


Pr(A) = Pr(B) = Pr(C) = 1/2 


and 
Pr(An B) =Pr(ANC)=Pr(BNC)=PrANBNC)=1/4. 


It follows that each of the three relations of Eq. (2.2.1) is satisfied but Eq. (2.2.2) is 
not satisfied. These results can be summarized by saying that the events A, B, and C 
are pairwise independent, but all three events are not independent. < 


We shall now present some examples that will illustrate the power and scope of 
the concept of independence in the solution of probability problems. 


Inspecting Items. Suppose that a machine produces a defective item with probability 
p (0 < p <1) and produces a nondefective item with probability 1 — p. Suppose 
further that six items produced by the machine are selected at random and inspected, 
and that the results (defective or nondefective) for these six items are independent. 
We shall determine the probability that exactly two of the six items are defective. 

It can be assumed that the sample space S contains all possible arrangements 
of six items, each one of which might be either defective or nondefective. For j = 
1,..., 6, we shall let D; denote the event that the jth item in the sample is defective 
so that D“ is the event that this item is nondefective. Since the outcomes for the six 
different items are independent, the probability of obtaining any particular sequence 
of defective and nondefective items will simply be the product of the individual 
probabilities for the items. For example, 


Pr(D} 9 Dy NM D5 D4 Ds Dé) = Pr(D}) Pr(D2) Pr(D§) Pr(D§) Pr(Ds) Pr( Dé) 


=(1— p)p— p)(1— p)pd— p)= pA — p)*. 
It can be seen that the probability of any other particular sequence in S containing 
two defective items and four nondefective items will also be p?(1 — p)*. Hence, the 
probability that there will be exactly two defectives in the sample of six items can be 
found by multiplying the probability p?(1 — p)* of any particular sequence containing 
two defectives by the possible number of such sequences. Since there are (5) distinct 
arrangements of two defective items and four nondefective items, the probability of 


obtaining exactly two defectives is (5) pd=p). < 


Obtaining a Defective Item. For the conditions of Example 2.2.5, we shall now deter- 
mine the probability that at least one of the six items in the sample will be defective. 

Since the outcomes for the different items are independent, the probability that 
all six items will be nondefective is (1 — p)°. Therefore, the probability that at least 
one item will be defective is 1 — (1 — p)°. <J 
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Tossing a Coin Until a Head Appears. Suppose that a fair coin is tossed until a head 
appears for the first time, and assume that the outcomes of the tosses are independent. 
We shall determine the probability p, that exactly n tosses will be required. 

The desired probability is equal to the probability of obtaining n — 1 tails in 
succession and then obtaining a head on the next toss. Since the outcomes of the 
tosses are independent, the probability of this particular sequence of n outcomes is 
Py = 1/2)". 

The probability that a head will be obtained sooner or later (or, equivalently, 
that tails will not be obtained forever) is 

lo.<) 

Se eee 

a 2 4 8 
Since the sum of the probabilities p,, is 1, it follows that the probability of obtaining 
an infinite sequence of tails without ever obtaining a head must be 0. < 


Inspecting Items One at a Time. Consider again a machine that produces a defective 
item with probability p and produces a nondefective item with probability 1 — p. 
Suppose that items produced by the machine are selected at random and inspected 
one at a time until exactly five defective items have been obtained. We shall deter- 
mine the probability p,, that exactly n items (1 > 5) must be selected to obtain the 
five defectives. 

The fifth defective item will be the nth item that is inspected if and only if there 
are exactly four defectives among the first n — 1 items and then the nth item is 
defective. By reasoning similar to that given in Example 2.2.5, it can be shown that 
the probability of obtaining exactly four defectives and n — 5 nondefectives among 
the first n — 1 items is ‘ee p*(1 — p)"~>. The probability that the nth item will be 
defective is p. Since the first event refers to outcomes for only the first n — 1 items 
and the second event refers to the outcome for only the nth item, these two events 
are independent. Therefore, the probability that both events will occur is equal to 
the product of their probabilities. It follows that 


n—-1 = 
ra =( 4 )p'a-p" >, < 


People v. Collins. Finkelstein and Levin (1990) describe a criminal case whose verdict 
was overturned by the Supreme Court of California in part due to a probability cal- 
culation involving both conditional probability and independence. The case, People 
v. Collins, 68 Cal. 2d 319, 438 P.2d 33 (1968), involved a purse snatching in which wit- 
nesses claimed to see a young woman with blond hair in a ponytail fleeing from the 
scene in a yellow car driven by a black man with a beard. A couple meeting the de- 
scription was arrested a few days after the crime, but no physical evidence was found. 
A mathematician calculated the probability that a randomly selected couple would 
possess the described characteristics as about 8.3 x 10-8, or 1 in 12 million. Faced 
with such overwhelming odds and no physical evidence, the jury decided that the 
defendants must have been the only such couple and convicted them. The Supreme 
Court thought that a more useful probability should have been calculated. Based 
on the testimony of the witnesses, there was a couple that met the above descrip- 
tion. Given that there was already one couple who met the description, what is the 
conditional probability that there was also a second couple such as the defendants? 

Let p be the probability that a randomly selected couple from a population of n 
couples has certain characteristics. Let A be the event that at least one couple in the 
population has the characteristics, and let B be the event that at least two couples 
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have the characteristics. What we seek is Pr(B|A). Since B C A, it follows that 
Pr(BM A) _ Pr(B) 
Pr(A) -~Pr(A)’ 
We shall calculate Pr(B) and Pr(A) by breaking each event into more manageable 
pieces. Suppose that we number the n couples in the population from 1 to n. Let A; 
be the event that couple number i has the characteristics in question fori =1,...,n, 
and let C be the event that exactly one couple has the characteristics. Then 
A=(A[NA5--- NAT)’, 
C=(AjNA5-+ +N ALU (APN AQNAS--- MAL) U---U (AP N+: NM Al_{NA,), 
B=ANC. 
Assuming that the n couples are mutually independent, Pr(A‘) = (1 — p)”, and 
Pr(A) = 1 — (1 — p)”. The n events whose union is C are disjoint and each one has 
probability p(1 — p)"—!, so Pr(C) =np(1 — p)"—|. Since A = BUC with B and C 
disjoint, we have 


Pr(B|A) = 


Pr(B) = Pr(A) — Pr(C) = 1— (1— p)" — np — py". 


So, 


(1 — p)" —np(— py"! 
tea) , 
The Supreme Court of California reasoned that, since the crime occurred in a 

heavily populated area, n would be in the millions. For example, with p = 8.3 x 10-8 

andn = 8,000,000, the value of (2.2.3) is 0.2966. Such a probability suggests that there 

is a reasonable chance that there was another couple meeting the same description 
as the witnesses provided. Of course, the court did not know how large n was, but the 
fact that (2.2.3) could easily be so large was grounds enough to rule that reasonable 
doubt remained as to the guilt of the defendants. < 


Pr(B|A) = 2 (2.2.3) 


Independence and Conditional Probability Two events A and B with positive 
probability are independent if and only if Pr(A|B) = Pr(A). Similar results hold for 
larger collections of independent events. The following theorem, for example, is 
straightforward to prove based on the definition of independence. 


Let A;,..., A, be events such that Pr(A,;M---MA;,) > 0. Then Ay, ..., Ay are 
independent if and only if, for every two disjoint subsets {i,,...,i,,} and {j,,..., je} 
of {1,..., k}, we have 


Pray, M-°°7 Ay As M---7 Aj,) = Pr( Aj, M-+- A;. ): a 
Theorem 2.2.2 says that k events are independent if and only if learning that 
some of the events occur does not change the probability that any combination of 


the other events occurs. 


The Meaning of Independence We have given a mathematical definition of inde- 
pendent events in Definition 2.2.1. We have also given some interpretations for what 
it means for events to be independent. The most instructive interpretation is the one 
based on conditional probability. If learning that B occurs does not change the prob- 
ability of A, then A and B are independent. In simple examples such as tossing what 
we believe to be a fair coin, we would generally not expect to change our minds 
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about what is likely to happen on later flips after we observe earlier flips; hence, we 
declare the events that concern different flips to be independent. However, consider 
a situation similar to Example 2.2.5 in which items produced by a machine are in- 
spected to see whether or not they are defective. In Example 2.2.5, we declared that 
the different items were independent and that each item had probability p of being 
defective. This might make sense if we were confident that we knew how well the 
machine was performing. But if we were unsure of how the machine were perform- 
ing, we could easily imagine changing our mind about the probability that the 10th 
item is defective depending on how many of the first nine items are defective. To be 
specific, suppose that we begin by thinking that the probability is 0.08 that an item 
will be defective. If we observe one or zero defective items in the first nine, we might 
not make much revision to the probability that the 10th item is defective. On the 
other hand, if we observe eight or nine defectives in the first nine items, we might be 
uncomfortable keeping the probability at 0.08 that the 10th item will be defective. In 
summary, when deciding whether to model events as independent, try to answer the 
following question: “If I were to learn that some of these events occurred, would I 
change the probabilities of any of the others?” If we feel that we already know ev- 
erything that we could learn from these events about how likely the others should be, 
we can safely model them as independent. If, on the other hand, we feel that learning 
some of these events could change our minds about how likely some of the others 
are, then we should be more careful about determining the conditional probabilities 
and not model the events as independent. 


Mutually Exclusive Events and Mutually Independent Events Two similar-sound- 
ing definitions have appeared earlier in this text. Definition 1.4.10 defines mutually 
exclusive events, and Definition 2.2.2 defines mutually independent events. It is 
almost never the case that the same set of events satisfies both definitions. The reason 
is that if events are disjoint (mutually exclusive), then learning that one occurs means 
that the others definitely did not occur. Hence, learning that one occurs would change 
the probabilities for all the others to 0, unless the others already had probability 0. 
Indeed, this suggests the only condition in which the two definitions would both apply 
to the same collection of events. The proof of the following result is left to Exercise 24 
in this section. 


Let n > 1 and let A,..., A, be events that are mutually exclusive. The events are 
also mutually independent if and only if all the events except possibly one of them 
has probability 0. a 


Conditionally Independent Events 


Conditional probability and independence combine into one of the most versatile 
models of data collection. The idea is that, in many circumstances, we are unwilling 
to say that certain events are independent because we believe that learning some of 
them will provide information about how likely the others are to occur. But if we 
knew the frequency with which such events would occur, we might then be willing 
to assume that they are independent. This model can be illustrated using one of the 
examples from earlier in this section. 


Inspecting Items. Consider again the situation in Example 2.2.5. This time, however, 
suppose that we believe that we would change our minds about the probabilities 
of later items being defective were we to learn that certain numbers of early items 


Definition 
2.2.3 


Theorem 
2.2.4 
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were defective. Suppose that we think of the number p from Example 2.2.5 as the 
proportion of defective items that we would expect to see if we were to inspect a very 
large sample of items. If we knew this proportion p, and if we were to sample only a 
few, say, six or 10 items now, we might feel confident maintaining that the probability 
of a later item being defective remains p even after we inspect some of the earlier 
items. On the other hand, if we are not sure what would be the proportion of defective 
items in a large sample, we might not feel confident keeping the probability the same 
as we continue to inspect. 

To be precise, suppose that we treat the proportion p of defective items as 
unknown and that we are dealing with an augmented experiment as described in 
Definition 2.1.3. For simplicity, suppose that p can take one of two values, either 0.01 
or 0.4, the first corresponding to normal operation and the second corresponding to 
a need for maintenance. Let B, be the event that p = 0.01, and let B, be the event 
that p = 0.4. If we knew that B, had occurred, then we would proceed under the 
assumption that the events D,, D>, ... were independent with Pr(D,|B,) = 0.01 for 
alli. For example, we could do the same calculations as in Examples 2.2.5 and 2.2.8 
with p = 0.01. Let A be the event that we observe exactly two defectives in a random 
sample of six items. Then Pr(A|B,) = (5)0.0170.994 = 1.44 x 10-°. Similarly, if we 
knew that B, had occurred, then we would assume that D;, D>, ... were independent 
with Pr(D,|B>) = 0.4. In this case, Pr(A| By) = (5)0.470.64 = 0.311. < 


In Example 2.2.10, there is no reason that p must be required to assume at most 
two different values. We could easily allow p to take a third value or a fourth value, 
etc. Indeed, in Chapter 3 we shall learn how to handle the case in which every number 
between 0 and 1 is a possible value of p. The point of the simple example is to illustrate 
the concept of assuming that events are independent conditional on another event, 
such as B, or B, in the example. 

The formal concept illustrated in Example 2.2.10 is the following: 


Conditional Independence. We say that events A;,..., A, are conditionally inde- 
pendent given B if, for every subcollection A;,,..., Ai, of j of these events (j = 
i ne) A 


Pr( Ay, 9-0 A;, 


B) = Pr(Aj,|B)- - Pr(4;,|B). 


Definition 2.2.3 is identical to Definition 2.2.2 for independent events with the mod- 
ification that all probabilities in the definition are now conditional on B. As a note, 
even if we assume that events A,,..., A, are conditionally independent given B, it 
is not necessary that they be conditionally independent given B°. In Example 2.2.10, 
the events D;, D2, ... were conditionally independent given both B, and B, = By, 
which is the typical situation. Exercise 16 in Sec. 2.3 is an example in which events are 
conditionally independent given one event B but are not conditionally independent 
given the complement B°. 

Recall that two events A, and A> (with Pr(A;) > 0) are independent if and only 
if Pr(A3|A,) = Pr(A3). A similar result holds for conditionally independent events. 


Suppose that A;, Az, and B are events such that Pr(A, MN B) > 0. Then A, and A; are 
conditionally independent given B if and only if Pr(A3|A; 9 B) = Pr(A2|B). a 


This is another example of the claim we made earlier that every result we can prove 
has an analog conditional on an event B. The reader can prove this theorem in 
Exercise 22. 
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The Collector’s Problem 


Suppose that n balls are thrown in a random manner into r boxes (r <n). We shall 
assume that the n throws are independent and that each of the r boxes is equally 
likely to receive any given ball. The problem is to determine the probability p that 
every box will receive at least one ball. This problem can be reformulated in terms of 
a collector’s problem as follows: Suppose that each package of bubble gum contains 
the picture of a baseball player, that the pictures of r different players are used, that 
the picture of each player is equally likely to be placed in any given package of gum, 
and that pictures are placed in different packages independently of each other. The 
problem now is to determine the probability p that a person who buys n packages of 
gum (n > r) will obtain a complete set of r different pictures. 

For i =1,...,r, let A; denote the event that the picture of player i is missing 
from all n packages. Then ();_, A; is the event that the picture of at least one player 
is missing. We shall find Pr((;_, A;) by applying Eq. (1.10.6). 

Since the picture of each of the r players is equally likely to be placed in any 
particular package, the probability that the picture of player i will not be obtained in 
any particular package is (r — 1)/r. Since the packages are filled independently, the 
probability that the picture of player i will not be obtained in any of the n packages 
is [(r — 1)/r}". Hence, 


r—1\" : 
Pr(A;) = ; fori=1,...,r. 


Now consider any two players i and j. The probability that neither the picture of 
player i nor the picture of player j will be obtained in any particular package is 
(r — 2)/r. Therefore, the probability that neither picture will be obtained in any of 
the n packages is [(r — 2)/r}'. Thus, 


x(a; 4) = (“—} 
. ac 


If we next consider any three players 7, j, and k, we find that 


r—3\" 
Pr(A; A; Ay) = (| ——] . 
. r 


By continuing in this way, we finally arrive at the probability Pr(A, N AzN---NA,) 
that the pictures of all r players are missing from the n packages. Of course, this 
probability is 0. Therefore, by Eq. (1.10.6) of Sec. 1.10, 


(Ua) (2) -() (2) + tor) 


j= 


Since the probability p of obtaining a complete set of r different pictures is equal to 
1- Pra A;), it follows from the foregoing derivation that p can be written in the 


form 
r-1 = j n 
= 1) ) (1 - i). 
r= DLe-'(i) 0-2 
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Summary 


A collection of events is independent if and only if learning that some of them occur 
does not change the probabilities that any combination of the rest of them occurs. 
Equivalently, a collection of events is independent if and only if the probability of the 
intersection of every subcollection is the product of the individual probabilities. The 
concept of independence has a version conditional on another event. A collection 
of events is independent conditional on B if and only if the conditional probability 
of the intersection of every subcollection given B is the product of the individual 
conditional probabilities given B. Equivalently, a collection of events is conditionally 
independent given B if and only if learning that some of them (and B) occur does 
not change the conditional probabilities given B that any combination of the rest of 
them occur. The full power of conditional independence will become more apparent 
after we introduce Bayes’ theorem in the next section. 


Exercises 


1. If A and B are independent events and Pr(B) < 1, what 
is the value of Pr(A‘|B‘)? 


2. Assuming that A and B are independent events, prove 
that the events A° and B° are also independent. 


3. Suppose that A is an event such that Pr(A) = 0 and that 
B is any other event. Prove that A and B are independent 
events. 


4. Suppose that a person rolls two balanced dice three 
times in succession. Determine the probability that on 
each of the three rolls, the sum of the two numbers that 
appear will be 7. 


5. Suppose that the probability that the control system 
used in a spaceship will malfunction on a given flight is 
0.001. Suppose further that a duplicate, but completely in- 
dependent, control system is also installed in the spaceship 
to take control in case the first system malfunctions. De- 
termine the probability that the spaceship will be under 
the control of either the original system or the duplicate 
system on a given flight. 


6. Suppose that 10,000 tickets are sold in one lottery and 
5000 tickets are sold in another lottery. If a person owns 
100 tickets in each lottery, what is the probability that she 
will win at least one first prize? 


7. Two students A and B are both registered for a certain 
course. Assume that student A attends class 80 percent of 
the time, student B attends class 60 percent of the time, 
and the absences of the two students are independent. 


a. What is the probability that at least one of the two 
students will be in class on a given day? 


b. Ifat least one of the two students is in class on a given 
day, what is the probability that A is in class that day? 


8. Ifthree balanced dice are rolled, what is the probability 
that all three numbers will be the same? 


9. Consider an experiment in which a fair coin is tossed 
until a head is obtained for the first time. If this experiment 
is performed three times, what is the probability that ex- 
actly the same number of tosses will be required for each 
of the three performances? 


10. The probability that any child in a certain family will 
have blue eyes is 1/4, and this feature is inherited indepen- 
dently by different children in the family. If there are five 
children in the family and it is known that at least one of 
these children has blue eyes, what is the probability that 
at least three of the children have blue eyes? 


11. Consider the family with five children described in 
Exercise 10. 


a. Ifitis known that the youngest child in the family has 
blue eyes, what is the probability that at least three 
of the children have blue eyes? 


b. Explain why the answer in part (a) is different from 
the answer in Exercise 10. 


12. Suppose that A, B, and C are three independent 
events such that Pr(A) = 1/4, Pr(B) = 1/3, and Pr(C) = 
1/2. (a) Determine the probability that none of these three 
events will occur. (b) Determine the probability that ex- 
actly one of these three events will occur. 


13. Suppose that the probability that any particle emitted 
by a radioactive material will penetrate a certain shield 
is 0.01. If 10 particles are emitted, what is the probability 
that exactly one of the particles will penetrate the shield? 
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14. Consider again the conditions of Exercise 13. If 10 
particles are emitted, what is the probability that at least 
one of the particles will penetrate the shield? 


15. Consider again the conditions of Exercise 13. How 
many particles must be emitted in order for the probability 
to be at least 0.8 that at least one particle will penetrate 
the shield? 


16. In the World Series of baseball, two teams A and B 
play a sequence of games against each other, and the first 
team that wins a total of four games becomes the winner 
of the World Series. If the probability that team A will 
win any particular game against team B is 1/3, what is the 
probability that team A will win the World Series? 


17. Two boys A and B throw a ball at a target. Suppose 
that the probability that boy A will hit the target on any 
throw is 1/3 and the probability that boy B will hit the 
target on any throw is 1/4. Suppose also that boy A throws 
first and the two boys take turns throwing. Determine the 
probability that the target will be hit for the first time on 
the third throw of boy A. 


18. For the conditions of Exercise 17, determine the prob- 
ability that boy A will hit the target before boy B does. 


19. A box contains 20 red balls, 30 white balls, and 50 
blue balls. Suppose that 10 balls are selected at random 
one at a time, with replacement; that is, each selected ball 
is replaced in the box before the next selection is made. 
Determine the probability that at least one color will be 
missing from the 10 selected balls. 


20. Suppose that A;,..., A, form a sequence of k inde- 
pendent events. Let B,,..., B, be another sequence of k 
events such that for each value of j (j =1,..., k), either 
B, =A; or Bj = A‘. Prove that B,,..., By are also inde- 
pendent events. Hint: Use an induction argument based 
on the number of events B; for which B; = A‘. 


21. Prove Theorem 2.2.2 on page 71. Hint: The “only if” 
direction is direct from the definition of independence on 
page 68. For the “if” direction, use induction on the value 
of j in the definition of independence. Let m = j — 1 and 
let €=1 with j; =i). 


22. Prove Theorem 2.2.4 on page 73. 


23. A programmer is about to attempt to compile a se- 
ries of 11 similar programs. Let A; be the event that the 
ith program compiles successfully fori = 1,..., 11. When 
the programming task is easy, the programmer expects 
that 80 percent of programs should compile. When the 
programming task is difficult, she expects that only 40 per- 
cent of the programs will compile. Let B be the event that 
the programming task was easy. The programmer believes 
that the events Aj, ..., Ay, are conditionally independent 
given B and given B°. 

a. Compute the probability that exactly 8 out of 11 

programs will compile given B. 


b. Compute the probability that exactly 8 out of 11 
programs will compile given B°. 


24. Prove Theorem 2.2.3 on page 72. 


2.3 Bayes’ Theorem 


Suppose that we are interested in which of several disjoint events By, . . 


og By will 


occur and that we will get to observe some other event A. If Pr(A|B,) is available 
for each i, then Bayes’ theorem is a useful formula for computing the conditional 
probabilities of the B; events given A. 


We begin with a typical example. 


Example 
2.3.1 


Test for a Disease. Suppose that you are walking down the street and notice that the 
Department of Public Health is giving a free medical test for a certain disease. The 


test is 90 percent reliable in the following sense: If a person has the disease, there is a 
probability of 0.9 that the test will give a positive response; whereas, if a person does 
not have the disease, there is a probability of only 0.1 that the test will give a positive 


response. 


Data indicate that your chances of having the disease are only 1 in 10,000. 
However, since the test costs you nothing, and is fast and harmless, you decide to 
stop and take the test. A few days later you learn that you had a positive response to 
the test. Now, what is the probability that you have the disease? <l 


Example 
2.3.2 
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The last question in Example 2.3.1 is a prototype of the question for which Bayes’ 
theorem was designed. We have at least two disjoint events (“you have the disease” 
and “you do not have the disease”) about which we are uncertain, and we learn a 
piece of information (the result of the test) that tells us something about the uncertain 
events. Then we need to know how to revise the probabilities of the events in the light 
of the information we learned. 

We now present the general structure in which Bayes’ theorem operates before 
returning to the example. 


Statement, Proof, and Examples of Bayes’ Theorem 


Selecting Bolts. Consider again the situation in Example 2.1.8, in which a bolt is 
selected at random from one of two boxes. Suppose that we cannot tell without 
making a further effort from which of the two boxes the one bolt is being selected. For 
example, the boxes may be identical in appearance or somebody else may actually 
select the box, but we only get to see the bolt. Prior to selecting the bolt, it was 
equally likely that each of the two boxes would be selected. However, if we learn that 
event A has occurred, that is, a long bolt was selected, we can compute the conditional 
probabilities of the two boxes given A. To remind the reader, B, is the event that the 
box is selected containing 60 long bolts and 40 short bolts, while B, is the event that 
the box is selected containing 10 long bolts and 20 short bolts. In Example 2.1.9, we 
computed Pr(A) = 7/15, Pr(A|B,) = 3/5, Pr(A|B>) = 1/3, and Pr(B,) = Pr( Bz) = 1/2. 
So, for example, 

Pr(AM By) _ Pr(By) Pr(A|By) _ 3 


x 
Pr(B,|A) = = 
TBA) = wD Pr(A) a, id 


Since the first box has a higher proportion of long bolts than the second box, it seems 
reasonable that the probability of B, should rise after we learn that a long bolt was 
selected. It must be that Pr(B,|A) = 5/14 since one or the other box had to be selected. 

<l 


In Example 2.3.2, we started with uncertainty about which of two boxes would 
be chosen and then we observed a long bolt drawn from the chosen box. Because the 
two boxes have different chances of having a long bolt drawn, the observation of a 
long bolt changed the probabilities of each of the two boxes having been chosen. The 
precise calculation of how the probabilities change is the purpose of Bayes’ theorem. 


Bayes’ theorem. Let the events By,..., By, form a partition of the space S such that 
Pr(B;) > 0 for j =1,...,k, and let A be an event such that Pr(A) > 0. Then, for 
ic — a nee on 

Pr(B;) Pr(A|B;) 


Sa Pr(B;) Pr(A|B,)_ 


Pr(B,|A) = (2.3.1) 


Proof By the definition of conditional probability, 
Pr(B; A) 
Pr(A) — 


The numerator on the right side of Eq. (2.3.1) is equal to Pr(B; N A) by Theorem 2.1.1. 
The denominator is equal to Pr(A) according to Theorem 2.1.4. o 


Pr(B;|A) = 
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Example 
2.3.3 


Example 
2.3.4 


Test for a Disease. Let us return to the example with which we began this section. 
We have just received word that we have tested positive for a disease. The test was 
90 percent reliable in the sense that we described in Example 2.3.1. We want to know 
the probability that we have the disease after we learn that the result of the test is 
positive. Some readers may feel that this probability should be about 0.9. However, 
this feeling completely ignores the small probability of 0.0001 that you had the disease 
before taking the test. We shall let B, denote the event that you have the disease, and 
let B, denote the event that you do not have the disease. The events B, and B, form 
a partition. Also, let A denote the event that the response to the test is positive. 
The event A is information we will learn that tells us something about the partition 
elements. Then, by Bayes’ theorem, 


Pr(A|B,) Pr(By) 
Pr(A|B,) Pr(By) + Pr(A| By) Pr(B2) 
_ (0.9)(0.0001) 
~ (0.9)(0.0001) + (0.1) (0.9999) 


Pr(B,|A) = 


= 0.00090. 


Thus, the conditional probability that you have the disease given the test result 
is approximately only 1 in 1000. Of course, this conditional probability is approxi- 
mately 9 times as great as the probability was before you were tested, but even the 
conditional probability is quite small. 

Another way to explain this result is as follows: Only one person in every 10,000 
actually has the disease, but the test gives a positive response for approximately one 
person in every 10. Hence, the number of positive responses is approximately 1000 
times the number of persons who actually have the disease. In other words, out of 
every 1000 persons for whom the test gives a positive response, only one person 
actually has the disease. This example illustrates not only the use of Bayes’ theorem 
but also the importance of taking into account all of the information available in a 
problem. <l 


Identifying the Source of a Defective Item. Three different machines M,, M>, and M3 
were used for producing a large batch of similar manufactured items. Suppose that 
20 percent of the items were produced by machine M;, 30 percent by machine M), 
and 50 percent by machine M3. Suppose further that 1 percent of the items produced 
by machine M, are defective, that 2 percent of the items produced by machine M, 
are defective, and that 3 percent of the items produced by machine M3 are defective. 
Finally, suppose that one item is selected at random from the entire batch and it is 
found to be defective. We shall determine the probability that this item was produced 
by machine M). 

Let B; be the event that the selected item was produced by machine M; (i = 
1, 2, 3), and let A be the event that the selected item is defective. We must evaluate 
the conditional probability Pr(B>|A). 

The probability Pr(B;) that an item selected at random from the entire batch was 
produced by machine M,; is as follows, for i = 1, 2, 3: 


Pr(B,) =0.2, Pr(B))=0.3,  Pr(B3) = 0.5. 


Furthermore, the probability Pr(A|B;) that an item produced by machine M; will be 
defective is 


Pr(A|B;) =0.01, Pr(A|B)) =0.02, Pr(A|B3) = 0.03. 


It now follows from Bayes’ theorem that 
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Pr(Bp) Pr(A|Bp) 
D5 _4 Pr(B;) Pr(A|B;) 
7 (0.3)(0.02) 7 
~ (0.2)(0.01) + (0.3)(0.02) + (0.5)(0.03) 


Pr(B3|A) = 


0.26. < 


Identifying Genotypes. Consider a gene that has two alleles (see Example 1.6.4 on 
page 23) A and a. Suppose that the gene exhibits itself through a trait (such as 
hair color or blood type) with two versions. We call A dominant and a recessive 
if individuals with genotypes AA and Aa have the same version of the trait and 
the individuals with genotype aa have the other version. The two versions of the 
trait are called phenotypes. We shall call the phenotype exhibited by individuals 
with genotypes AA and Aa the dominant trait, and the other trait will be called the 
recessive trait. In population genetics studies, it is common to have information on the 
phenotypes of individuals, but it is rather difficult to determine genotypes. However, 
some information about genotypes can be obtained by observing phenotypes of 
parents and children. 

Assume that the allele A is dominant, that individuals mate independently of 
genotype, and that the genotypes AA, Aa, and aa occur in the population with prob- 
abilities 1/4, 1/2, and 1/4, respectively. We are going to observe an individual whose 
parents are not available, and we shall observe the phenotype of this individual. Let 
E be the event that the observed individual has the dominant trait. We would like 
to revise our opinion of the possible genotypes of the parents. There are six possible 
genotype combinations, B;, ..., Bg, for the parents prior to making any observations, 
and these are listed in Table 2.2. 

The probabilities of the B; were computed using the assumption that the parents 
mated independently of genotype. For example, B3 occurs if the father is AA and the 
mother is aa (probability 1/16) or if the father is aa and the mother is AA (probability 
1/16). The values of Pr(E|B;) were computed assuming that the two available alleles 
are passed from parents to children with probability 1/2 each and independently for 
the two parents. For example, given By, the event E occurs if and only if the child 
does not get two a’s. The probability of getting a from both parents given By is 1/4, 
so Pr(E| By) = 3/4. 

Now we shall compute Pr(B,|£) and Pr(Bs|£). We leave the other calculations 
to the reader. The denominator of Bayes’ theorem is the same for both calculations, 
namely, 


5 
Pr(E) = 2 Pr(B;) Pr(E|B;) 
= 


1 1 1 13 1 
= 1 1 1 0= 
cg a a ee 


& 
— 
oO 
iN 


Table 2.2 Parental genotypes for Example 2.3.5 


(AA, AA) (AA, Aa) (AA,aa) (Aa, Aa) (Aa,aa) (aa, aa) 


Name of event By B> B3 By Bs Bo 
Probability of B; 1/16 1/4 1/8 1/4 1/4 1/16 
Pr(£|B;) 1 1 1 3/4 1/2 0 
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Applying Bayes’ theorem, we get 


1 1 

4 ‘ll 1 = 

Pr(By|E) = “— = 7p Pr(BslE) = 4 
q 


3 


Note: Conditional Version of Bayes’ Theorem. There is also a version of Bayes’ 
theorem conditional on an event C: 


Pr(B;|C) Pr(A|B;  C) 
Dj _-1 Pr(BjIC) Pr(A|B; NC) 


Pr(B,|A NC) = (2.3.2) 


Prior and Posterior Probabilities 


In Example 2.3.4, a probability like Pr(B>) is often called the prior probability that 
the selected item will have been produced by machine M), because Pr(B;) is the 
probability of this event before the item is selected and before it is known whether 
the selected item is defective or nondefective. A probability like Pr(B)|A) is then 
called the posterior probability that the selected item was produced by machine M), 
because it is the probability of this event after it is known that the selected item is 
defective. 

Thus, in Example 2.3.4, the prior probability that the selected item will have been 
produced by machine Mj is 0.3. After an item has been selected and has been found 
to be defective, the posterior probability that the item was produced by machine 
Mz is 0.26. Since this posterior probability is smaller than the prior probability that 
the item was produced by machine M), the posterior probability that the item was 
produced by one of the other machines must be larger than the prior probability that 
it was produced by one of those machines (see Exercises 1 and 2 at the end of this 
section). 


Computation of Posterior Probabilities in More Than One Stage 


Suppose that a box contains one fair coin and one coin with a head on each side. 
Suppose also that one coin is selected at random and that when it is tossed, a head is 
obtained. We shall determine the probability that the coin is the fair coin. 

Let B, be the event that the coin is fair, let B, be the event that the coin has two 
heads, and let H; be the event that a head is obtained when the coin is tossed. Then, 
by Bayes’ theorem, 


Pr(By) Pr(A| By) 
Pr(B,) Pr(H;| By) + Pr( Bo) Pr(Hj|B) 


(22) oa) 


(1/2)/2) ++ /2)Q) 3 
Thus, after the first toss, the posterior probability that the coin is fair is 1/3. 

Now suppose that the same coin is tossed again and we assume that the two 
tosses are conditionally independent given both B, and B,. Suppose that another 
head is obtained. There are two ways of determining the new value of the posterior 
probability that the coin is fair. 

The first way is to return to the beginning of the experiment and assume again 
that the prior probabilities are Pr(B,) = Pr(B>) = 1/2. We shall let H, N H> denote the 
event in which heads are obtained on two tosses of the coin, and we shall calculate the 
posterior probability Pr(B,|H, M H>) that the coin is fair after we have observed the 


Pr(By|H}) = 
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event H, 1 H>. The assumption that the tosses are conditionally independent given 
B, means that Pr(H, MN A>|B,) = 1/2 x 1/2 = 1/4. By Bayes’ theorem, 


Pr(By) Pr(Ay 1 Hp| By) 
Pr(B,) Pr( Ay O Ap| By) + Pr( Bz) Pr( Ay 9 Ap| Bo) 
_ (1/2)(1/4) oe! 
~ (1/2)(1/4) + /2)0) 5 


The second way of determining this same posterior probability is to use the 
conditional version of Bayes’ theorem (2.3.2) given the event H,. Given Hj, the 
conditional probability of B, is 1/3, and the conditional probability of B, is therefore 
2/3. These conditional probabilities can now serve as the prior probabilities for the 
next stage of the experiment, in which the coin is tossed a second time. Thus, we 
can apply (2.3.2) with C = H,, Pr(B;|A,) = 1/3, and Pr(B,|H,) = 2/3. We can then 
compute the posterior probability Pr(B,|H, M H>) that the coin is fair after we have 
observed a head on the second toss and a head on the first toss. We shall need 
Pr(A>|B, 9 Ay), which equals Pr(H5|B,) = 1/2 by Theorem 2.2.4 since H; and H> are 
conditionally independent given B,. Since the coin is two-headed when B, occurs, 
Pr(A>|B> N A) = 1. So we obtain 


Pr(B,| Hy, 9 A) = 


(2.3.4) 


Pr(B,| A) Pr(Ay| By, 9 Ay) 

Pr(B,| Ay) Pr( |B, 0 Ay) + Pr(Bg| Ay) Pr(Ap| By 1 Ay) 
[ (1/3)(1/2) _l 
(1/3)(1/2) + (2/3)) 5 


The posterior probability of the event B, obtained in the second way is the same 
as that obtained in the first way. We can make the following general statement: If an 
experiment is carried out in more than one stage, then the posterior probability of 
every event can also be calculated in more than one stage. After each stage has been 
carried out, the posterior probability calculated for the event after that stage serves 
as the prior probability for the next stage. The reader should look back at (2.3.2) 
to see that this interpretation is precisely what the conditional version of Bayes’ 
theorem says. The example we have been doing with coin tossing is typical of many 
applications of Bayes’ theorem and its conditional version because we are assuming 
that the observable events are conditionally independent given each element of the 
partition B,,..., B, (in this case, k = 2). The conditional independence makes the 
probability of H; (head on ith toss) given B, (or given B) the same whether or not 
we also condition on earlier tosses (see Theorem 2.2.4). 


Pr(By| Ay @ Ay) = 


(2.3.5) 


o 


Ca 


A 


>, 
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Conditionally Independent Events 


The calculations that led to (2.3.3) and (2.3.5) together with Example 2.2.10 illustrate 
simple cases of a very powerful statistical model for observable events. It is very 
common to encounter a sequence of events that we believe are similar in that they 
all have the same probability of occurring. It is also common that the order in which 
the events are labeled does not affect the probabilities that we assign. However, 
we often believe that these events are not independent, because, if we were to 
observe some of them, we would change our minds about the probability of the 
ones we had not observed depending on how many of the observed events occur. 
For example, in the coin-tossing calculation leading up to Eq. (2.3.3), before any 
tosses occur, the probability of H is the same as the probability of H,, namely, the 


82 


Chapter 2 Conditional Probability 


Example 
2.3.6 


Example 
2.3.7 


denominator of (2.3.3), 3/4, as Theorem 2.1.4 says. However, after observing that 
the event H; occurs, the probability of H, is Pr(H>|H,), which is the denominator of 
(2.3.5), 5/6, as computed by the conditional version of the law of total probability 
(2.1.5). Even though we might treat the coin tosses as independent conditional 
on the coin being fair, and we might treat them as independent conditional on 
the coin being two-headed (in which case we know what will happen every time 
anyway), we cannot treat them as independent without the conditioning information. 
The conditioning information removes an important source of uncertainty from 
the problem, so we partition the sample space accordingly. Now we can use the 
conditional independence of the tosses to calculate joint probabilities of various 
combinations of events conditionally on the partition events. Finally, we can combine 
these probabilities using Theorem 2.1.4 and (2.1.5). Two more examples will help to 
illustrate these ideas. 


Learning about a Proportion. In Example 2.2.10 on page 72, a machine produced 
defective parts in one of two proportions, p = 0.01 or p = 0.4. Suppose that the prior 
probability that p = 0.01 is 0.9. After sampling six parts at random, suppose that we 
observe two defectives. What is the posterior probability that p = 0.01? 

Let B, = {p = 0.01} and B, = {p = 0.4} as in Example 2.2.10. Let A be the event 
that two defectives occur in a random sample of size six. The prior probability of 
B, is 0.9, and the prior probability of By is 0.1. We already computed Pr(A|B,) = 
1.44 x 10-3 and Pr(A|B>) = 0.311 in Example 2.2.10. Bayes’ theorem tells us that 


=3 
Pr(B,|A) = 0.9 x 1.44 x 10 0.04. 
0.9 x 1.44 x 10-3 4.0.1 x 0.311 
Even though we thought originally that B, had probability as high as 0.9, after we 
learned that there were two defective items in a sample as small as six, we changed 
our minds dramatically and now we believe that B, has probability as small as 0.04. 
The reason for this major change is that the event A that occurred has much higher 
probability if B, is true than if B, is true. <l 


A Clinical Trial. Consider the same clinical trial described in Examples 2.1.12 and 
2.1.13. Let E; be the event that the ith patient has success as her outcome. Recall 
that B; is the event that p =(j — 1)/10 for j = 1, ..., 11, where p is the proportion 
of successes among all possible patients. If we knew which B,; occurred, we would 
say that Ey, E>, ... were independent. That is, we are willing to model the patients 
as conditionally independent given each event B,, and we set Pr(E;|B;) = (j — 1)/10 
for all i, 7. We shall still assume that Pr(B;) = 1/11 for all j prior to the start of the 
trial. We are now in position to express what we learn about p by computing posterior 
probabilities for the B; events after each patient finishes the trial. 

For example, consider the first patient. We calculated Pr(£,) = 1/2 in (2.1.6). If 
E, occurs, we apply Bayes’ theorem to get 


Pr(E,|B;) Pr(Bj) 24j-Y) j-1 
1/2 ~ 40x11 55° 


Pr(B;|Ey) = (2.3.6) 
After observing one success, the posterior probabilities of large values of p are higher 
than their prior probabilities and the posterior probabilities of low values of p are 
lower than their prior probabilities as we would expect. For example, Pr(B,|E,) = 0, 
because p = 0 is ruled out after one success. Also, Pr(B>| £1) = 0.0182, which is much 
smaller than its prior value 0.0909, and Pr(B,,|£,) = 0.1818, which is larger than its 
prior value 0.0909. 
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Figure 2.3 The posterior probabilities of partition 
elements after 40 patients in Example 2.3.7. 


We could check how the posterior probabilities behave after each patient is 
observed. However, we shall skip ahead to the point at which all 40 patients in the 
imipramine column of Table 2.1 have been observed. Let A stand for the observed 
event that 22 of them are successes and 18 are failures. We can use the same reasoning 
as in Example 2.2.5 to compute Pr(A|B;). There are Ci) possible sequences of 40 
patients with 22 successes, and, conditional on B,, the probability of each sequence 
is ({7 — 11/10) — [7 — 1/10)". 

So, 


40 
Pr(A|B,) = (oO) — 1/10) = [7 = 1/108, (2.3.7) 


for each j. Then Bayes’ theorem tells us that 


ri (ga) (Uj — 11/10) = [j - 11/10)" 
Diet ro2) (li — 1/10) — [i — 1]/10)"8 
Figure 2.3 shows the posterior probabilities of the 11 partition elements after observ- 
ing A. Notice that the probabilities of Bg and B; are the highest, 0.42. This corresponds 
to the fact that the proportion of successes in the observed sample is 22/40 = 0.55, 
halfway between (6 — 1)/10 and (7 — 1)/10. 

We can also compute the probability that the next patient will be a success both 
before the trial and after the 40 patients. Before the trial, Pr(£4,) = Pr(£,), which 
equals 1/2, as computed in (2.1.6). After observing the 40 patients, we can compute 
Pr(£4;|A) using the conditional version of the law of total probability, (2.1.5): 

ra 
Pr(E4|A) = > Pr(E4|B; 9 A) Pr(B;IA). (2.3.8) 

j=l 
Using the values of Pr(B || A) in Fig. 2.3 and the fact that Pr(£4;|B; 0 A) = Pr(E4)|B;) 
= (j — 1)/10 (conditional independence of the E; given the B;), we compute (2.3.8) 
to be 0.5476. This is also very close to the observed frequency of success. < 


Pr(Bj|A) = 


The calculation at the end of Example 2.3.7 is typical of what happens after ob- 
serving many conditionally independent events with the same conditional probability 
of occurrence. The conditional probability of the next event given those that were 
observed tends to be close to the observed frequency of occurrence among the ob- 
served events. Indeed, when there is substantial data, the choice of prior probabilities 
becomes far less important. 
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Figure 2.4 The posterior probabilities of partition 
elements after 40 patients in Example 2.3.8. The X 
characters mark the values of the posterior probabilities 
calculated in Example 2.3.7. 


The Effect of Prior Probabilities. Consider the same clinical trial as in Example 2.3.7. 
This time, suppose that a different researcher has a different prior opinion about the 
value of p, the probability of success. This researcher believes the following prior 
probabilities: 


Event By By B, By Bs Bo By Bg Bo Bio By 


Dp 00 O01 O02 03 O04 O05 06 07 O8 O09 1.0 
Prior prob. 0.00 0.19 0.19 0.17 0.14 O11 0.09 0.06 0.04 0.01 0.00 


We can recalculate the posterior probabilities using Bayes’ theorem, and we get 
the values pictured in Fig. 2.4. To aid comparison, the posterior probabilities from 
Example 2.3.7 are also plotted in Fig. 2.4 using the symbol X. One can see how 
close the two sets of posterior probabilities are despite the large differences between 
the prior probabilities. If there had been fewer patients observed, there would have 
been larger differences between the two sets of posterior probabilites because the 
observed events would have provided less information. (See Exercise 12 in this 
section.) 


Soa 


Exercises 


1. Suppose that & events By, . 


Summary 


Bayes’ theorem tells us how to compute the conditional probability of each event ina 
partition given an observed event A. A major use of partitions is to divide the sample 
space into small enough pieces so that a collection of events of interest become 
conditionally independent given each event in the partition. 


.., B, form a partition of of B; given that the event A has occurred. Prove that if 


the sample space S. Fori =1,...,k, let Pr(B;) denote the Pr(B,|A) < Pr(Bj), then Pr(B;|A) > Pr(B;) for at least one 
prior probability of B;. Also, for each event A such that value of i (i =2,...,k). 
Pr(A) > 0, let Pr(B;|A) denote the posterior probability 


2. Consider again the conditions of Example 2.3.4 in this 
section, in which an item was selected at random from 
a batch of manufactured items and was found to be de- 
fective. For which values of i (i = 1, 2, 3) is the posterior 
probability that the item was produced by machine M; 
larger than the prior probability that the item was pro- 
duced by machine M;? 


3. Suppose that in Example 2.3.4 in this section, the item 
selected at random from the entire lot is found to be non- 
defective. Determine the posterior probability that it was 
produced by machine M). 


4. A new test has been devised for detecting a particular 
type of cancer. If the test is applied to a person who has this 
type of cancer, the probability that the person will have a 
positive reaction is 0.95 and the probability that the person 
will have a negative reaction is 0.05. If the test is applied to 
a person who does not have this type of cancer, the prob- 
ability that the person will have a positive reaction is 0.05 
and the probability that the person will have a negative re- 
action is 0.95. Suppose that in the general population, one 
person out of every 100,000 people has this type of can- 
cer. If a person selected at random has a positive reaction 
to the test, what is the probability that he has this type of 
cancer? 


5. In a certain city, 30 percent of the people are Conser- 
vatives, 50 percent are Liberals, and 20 percent are Inde- 
pendents. Records show that in a particular election, 65 
percent of the Conservatives voted, 82 percent of the Lib- 
erals voted, and 50 percent of the Independents voted. If 
a person in the city is selected at random and it is learned 
that she did not vote in the last election, what is the prob- 
ability that she is a Liberal? 


6. Suppose that when a machine is adjusted properly, 50 
percent of the items produced by it are of high quality 
and the other 50 percent are of medium quality. Suppose, 
however, that the machine is improperly adjusted during 
10 percent of the time and that, under these conditions, 25 
percent of the items produced by it are of high quality and 
75 percent are of medium quality. 


a. Suppose that five items produced by the machine at 
a certain time are selected at random and inspected. 
If four of these items are of high quality and one item 
is of medium quality, what is the probability that the 
machine was adjusted properly at that time? 


b. Suppose that one additional item, which was pro- 
duced by the machine at the same time as the other 
five items, is selected and found to be of medium 
quality. What is the new posterior probability that 
the machine was adjusted properly? 


7. Suppose that a box contains five coins and that for 
each coin there is a different probability that a head will 
be obtained when the coin is tossed. Let p; denote the 
probability of a head when the ith coin is tossed (i = 
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1,...,5), and suppose that pj =0, po = 1/4, p3 =1/2, 
P= 3/4, and Ps= ill, 


a. Suppose that one coin is selected at random from the 
box and when it is tossed once, a head is obtained. 
Whatis the posterior probability that the ith coin was 
selected (i =1,..., 5)? 

b. If the same coin were tossed again, what would be 
the probability of obtaining another head? 


c. If a tail had been obtained on the first toss of the 
selected coin and the same coin were tossed again, 
what would be the probability of obtaining a head 
on the second toss? 


8. Consider again the box containing the five different 
coins described in Exercise 7. Suppose that one coin is 
selected at random from the box and is tossed repeatedly 
until a head is obtained. 


a. If the first head is obtained on the fourth toss, what 
is the posterior probability that the ith coin was se- 
lected (i =1,...,5)? 

b. If we continue to toss the same coin until another 


head is obtained, what is the probability that exactly 
three additional tosses will be required? 


9. Consider again the conditions of Exercise 14 in Sec. 2.1. 
Suppose that several parts will be observed and that the 
different parts are conditionally independent given each 
of the three states of repair of the machine. If seven parts 
are observed and exactly one is defective, compute the 
posterior probabilities of the three states of repair. 


10. Consider again the conditions of Example 2.3.5, in 
which the phenotype of an individual was observed and 
found to be the dominant trait. For which values of i 
(i =1,..., 6) is the posterior probability that the parents 
have the genotypes of event B; smaller than the prior 
probability that the parents have the genotyes of event 


11. Suppose that in Example 2.3.5 the observed individual 
has the recessive trait. Determine the posterior probabil- 
ity that the parents have the genotypes of event By. 


12. In the clinical trial in Examples 2.3.7 and 2.3.8, sup- 
pose that we have only observed the first five patients and 
three of the five had been successes. Use the two different 
sets of prior probabilities from Examples 2.3.7 and 2.3.8 
to calculate two sets of posterior probabilities. Are these 
two sets of posterior probabilities as close to each other 
as were the two in Examples 2.3.7 and 2.3.8? Why or why 
not? 


13. Suppose that a box contains one fair coin and one coin 
with a head on each side. Suppose that a coin is drawn at 
random from this box and that we begin to flip the coin. 
In Eqs. (2.3.4) and (2.3.5), we computed the conditional 


86 Chapter 2 Conditional Probability 


probability that the coin was fair given that the first two 
flips both produce heads. 


a. Suppose that the coin is flipped a third time and 
another head is obtained. Compute the probability 
that the coin is fair given that all three flips produced 
heads. 


b. Suppose that the coin is flipped a fourth time and the 
result is tails. Compute the posterior probability that 
the coin is fair. 


14. Consider again the conditions of Exercise 23 in Sec. 
2.2. Assume that Pr(B) = 0.4. Let A be the event that ex- 
actly 8 out of 11 programs compiled. Compute the condi- 
tional probability of B given A. 


15. Use the prior probabilities in Example 2.3.8 for the 
events By,..., By,. Let E, be the event that the first pa- 
tient is a success. Compute the probability of E, and ex- 
plain why it is so much less than the value computed in 
Example 2.3.7. 


16. Consider a machine that produces items in sequence. 
Under normal operating conditions, the items are 


independent with probability 0.01 of being defective. 
However, it is possible for the machine to develop a 
“memory” in the following sense: After each defective 
item, and independent of anything that happened earlier, 
the probability that the next item is defective is 2/5. Af- 
ter each nondefective item, and independent of anything 
that happened earlier, the probability that the next item 
is defective is 1/165. 

Assume that the machine is either operating normally 
for the whole time we observe or has a memory for the 
whole time that we observe. Let B be the event that the 
machine is operating normally, and assume that Pr(B) = 
2/3. Let D; be the event that the ith item inspected is 
defective. Assume that D, is independent of B. 


a. Prove that Pr(D;) = 0.01 for all i. Hint: Use induc- 
tion. 

b. Assume that we observe the first six items and the 
event that occurs is E = Di} 0 DSN D3N DgN DEN 
Dé. That is, the third and fourth items are defective, 
but the other four are not. Compute Pr(B|D). 


* 2.4 The Gambler’s Ruin Problem 


Consider two gamblers with finite resources who repeatedly play the same game 
against each other. Using the tools of conditional probability, we can calculate the 
probability that each of the gamblers will eventually lose all of his money to the 


opponent. 


Statement of the Problem 


Suppose that two gamblers A and B are playing a game against each other. Let p 
be a given number (0 < p <1), and suppose that on each play of the game, the 
probability that gambler A will win one dollar from gambler B is p and the probability 
that gambler B will win one dollar from gambler A is 1 — p. Suppose also that the 
initial fortune of gambler A isi dollars and the initial fortune of gambler B is k —i 
dollars, where i and k — i are given positive integers. Thus, the total fortune of the 
two gamblers is & dollars. Finally, suppose that the gamblers play the game repeatedly 
and independently until the fortune of one of them has been reduced to 0 dollars. 
Another way to think about this problem is that B is a casino and A is a gambler who 
is determined to quit as soon he wins k — i dollars from the casino or when he goes 
broke, whichever comes first. 

We shall now consider this game from the point of view of gambler A. His initial 
fortune isi dollars and on each play of the game his fortune will either increase by one 
dollar with a probability of p or decrease by one dollar with a probability of 1 — p. 
If p > 1/2, the game is favorable to him; if p < 1/2, the game is unfavorable to him; 
and if p = 1/2, the game is equally favorable to both gamblers. The game ends either 
when the fortune of gambler A reaches k dollars, in which case gambler B will have 
no money left, or when the fortune of gambler A reaches 0 dollars. The problem is to 
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determine the probability that the fortune of gambler A will reach k dollars before 
it reaches 0 dollars. Because one of the gamblers will have no money left at the end 
of the game, this problem is called the Gambler's Ruin problem. 


Solution of the Problem 


We shall continue to assume that the total fortune of the gamblers A and B is k dollars, 
and we shall let a; denote the probability that the fortune of gambler A will reach k 
dollars before it reaches 0 dollars, given that his initial fortune isi dollars. We assume 
that the game is the same each time it is played and the plays are independent of each 
other. It follows that, after each play, the Gambler’s Ruin problem essentially starts 
over with the only change being that the initial fortunes of the two gamblers have 
changed. In particular, for each j =0,..., k, each time that we observe a sequence 
of plays that lead to gambler A’s fortune being j dollars, the conditional probability, 
given such a sequence, that gambler A wins is a;. If gambler A’s fortune ever reaches 
0, then gambler A is ruined, hence ay = 0. Similarly, if his fortune ever reaches k, 
then gambler A has won, hence a; = 1. We shall now determine the value of a; for 
i=1,...,k-1 

Let A, denote the event that gambler A wins one dollar on the first play of the 
game, let B, denote the event that gambler A loses one dollar on the first play of the 
game, and let W denote the event that the fortune of gambler A ultimately reaches 
k dollars before it reaches 0 dollars. Then 


Pr(W) = Pr(Ay) Pr(W|A,) + Pr(By) Pr(W1By) 
= pPr(W|A,) + (1— p)Pr(W| By). (2.4.1) 
Since the initial fortune of gambler A isi dollars (i = 1,..., k — 1), then Pr(W) = qj. 
Furthermore, if gambler A wins one dollar on the first play of the game, then his 
fortune becomes i + 1 dollars and the conditional probability Pr(W|A,) that his 
fortune will ultimately reach k dollars is therefore a;,,. If A loses one dollar on the 
first play of the game, then his fortune becomes i — 1 dollars and the conditional 


probability Pr(W|B,) that his fortune will ultimately reach k dollars is therefore a;_}. 
Hence, by Eq. (2.4.1), 


a; = paj4,+ (1 — p)aj_1. (2.4.2) 


We shall let i =1,...,k —1 in Eq. (2.4.2). Then, since ag = 0 and aq = 1, we 
obtain the following k — 1 equations: 
a =P), 
ay =pa3 + (1 — p)ay, 
a3 =pa4 + (1 — p)ap, 
; (2.4.3) 
Ag—2 =pay—1 + (1 — p)ag-s, 
a1 =p + (1 — p)ag_z. 


If the value of a; on the left side of the ith equation is rewritten in the form pa; + 
(1 — p)a; and some elementary algebra is performed, then these k — 1 equations can 
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be rewritten as follows: 


a -a,= a 

a3 — a) = : ; Pa a) = (<*) ay, 

a4 — 43> tas a2) = (2) a, (2.4.4) 
a4 — Ap. = = (Ax—2 — Ax—3) = ee a, 


a — k-1 
1-aq_,= (ag_1 — 4-2) = | —— ay. 
P p 
By equating the sum of the left sides of these k — 1 equations with the sum of the 
right sides, we obtain the relation 


k-1 1- D i 
1-a,=a,) > (—). (2.4.5) 
ery 2 


Solution for a Fair Game _ Suppose first that p = 1/2. Then (1 — p)/p =1, and it 
follows from Eq. (2.4.5) that 1 — a, = (k — 1)a,, from which a; = 1/k. In turn, it follows 
from the first equation in (2.4.4) that a, = 2/k, it follows from the second equation in 
(2.4.4) that a3 = 3/k, and so on. In this way, we obtain the following complete solution 
when p = 1/2: 


a =~ fori=1,...,k—1. (2.4.6) 


The Probability of Winning in a Fair Game. Suppose that p = 1/2, in which case the 
game is equally favorable to both gamblers; and suppose that the initial fortune of 
gambler A is 98 dollars and the initial fortune of gambler B is just two dollars. In 
this example, i = 98 and k = 100. Therefore, it follows from Eq. (2.4.6) that there 
is a probability of 0.98 that gambler A will win two dollars from gambler B before 
gambler B wins 98 dollars from gambler A. | 


Solution for an Unfair Game Suppose now that p 41/2. Then Eq. (2.4.5) can be 


rewritten in the form 
P Pp 
= a, 7. (2.4.7) 


Hence, 


(2.4.8) 
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Each of the other values of a; for i =2,..., k — 1 can now be determined in turn 
from the equations in (2.4.4). In this way, we obtain the following complete solution: 


(5) 


a, = ———_~—__ fori=1,...,k-1. 
l=—p 
(452) -1 


The Probability of Winning in an Unfavorable Game. Suppose that p = 0.4, in which 
case the probability that gambler A will win one dollar on any given play is smaller 
than the probability that he will lose one dollar. Suppose also that the initial fortune 
of gambler A is 99 dollars and the initial fortune of gambler B is just one dollar. We 
shall determine the probability that gambler A will win one dollar from gambler B 
before gambler B wins 99 dollars from gambler A. 

In this example, the required probability a; is given by Eq. (2.4.9), in which 
(1 — p)/p =3/2,i = 99, and k = 100. Therefore, 


QP 1 


~ — 


(2.4.9) 


("1 a 


Hence, although the probability that gambler A will win one dollar on any given play 
is only 0.4, the probability that he will win one dollar before he loses 99 dollars is 
approximately 2/3. «| 


Summary 


We considered a gambler and an opponent who each start with finite amounts of 
money. The two then play a sequence of games against each other until one of them 
runs out of money. We were able to calculate the probability that each of them would 
be the first to run out as a function of the probability of winning the game and of how 


much money each has at the start. 


Exercises 


1. Consider the unfavorable game in Example 2.4.2. This 
time, suppose that the initial fortune of gambler A is i 
dollars with i < 98. Suppose that the initial fortune of 
gambler B is 100 —i dollars. Show that the probability 
is greater than 1/2 that gambler A losses i dollars before 
winning 100 — i dollars. 


2. Consider the following three different possible condi- 
tions in the gambler’s ruin problem: 
a. The initial fortune of gambler A is two dollars, and 
the initial fortune of gambler B is one dollar. 
b. The initial fortune of gambler A is 20 dollars, and the 
initial fortune of gambler B is 10 dollars. 


c. The initial fortune of gambler A is 200 dollars, and 
the initial fortune of gambler B is 100 dollars. 


Suppose that p = 1/2. For which of these three condi- 
tions is there the greatest probability that gambler A will 
win the initial fortune of gambler B before he loses his 
own initial fortune? 


3. Consider again the three different conditions (a), (b), 
and (c) given in Exercise 2, but suppose now that p < 1/2. 
For which of these three conditions is there the greatest 
probability that gambler A will win the initial fortune of 
gambler B before he loses his own initial fortune? 


4. Consider again the three different conditions (a), (b), 
and (c) given in Exercise 2, but suppose now that p > 1/2. 
For which of these three conditions is there the greatest 
probability that gambler A will win the initial fortune of 
gambler B before he loses his own initial fortune? 
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5. Suppose that on each play of a certain game, a person is 
equally likely to win one dollar or lose one dollar. Suppose 
also that the person’s goal is to win two dollars by playing 
this game. How large an initial fortune must the person 
have in order for the probability to be at least 0.99 that she 
will achieve her goal before she loses her initial fortune? 


6. Suppose that on each play of a certain game, a person 
will either win one dollar with probability 2/3 or lose one 
dollar with probability 1/3. Suppose also that the person’s 
goal is to win two dollars by playing this game. How large 
an initial fortune must the person have in order for the 
probability to be at least 0.99 that he will achieve his goal 
before he loses his initial fortune? 


7. Suppose that on each play of a certain game, a person 
will either win one dollar with probability 1/3 or lose one 
dollar with probability 2/3. Suppose also that the person’s 
goal is to win two dollars by playing this game. Show that 
no matter how large the person’s initial fortune might be, 


the probability that she will achieve her goal before she 
loses her initial fortune is less than 1/4. 


8. Suppose that the probability of a head on any toss of 
a certain coin is p (0 < p <1), and suppose that the coin 
is tossed repeatedly. Let X,, denote the total number of 
heads that have been obtained on the first n tosses, and 
let Y, =n — X, denote the total number of tails on the 
first n tosses. Suppose that the tosses are stopped as soon 
as a number n is reached such that either X, = Y, +3 or 
Y, = X, +3. Determine the probability that X, = Y, +3 
when the tosses are stopped. 


9. Suppose that a certain box A contains five balls and an- 
other box B contains 10 balls. One of these two boxes is 
selected at random, and one ball from the selected box is 
transferred to the other box. If this process of selecting a 
box at random and transferring one ball from that box to 
the other box is repeated indefinitely, what is the probabil- 
ity that box A will become empty before box B becomes 
empty? 


2.5 Supplementary Exercises 


1. Suppose that A, B, and D are any three events such that 
Pr(A|D) > Pr(B|D) and Pr(A|D°) > Pr(B|D°). Prove that 
Pr(A) > Pr(B). 


2. Suppose that a fair coin is tossed repeatedly and inde- 
pendently until both a head and a tail have appeared at 
least once. (a) Describe the sample space of this experi- 
ment. (b) What is the probability that exactly three tosses 
will be required? 


3. Suppose that A and B are events such that Pr(A) = 
1/3, Pr(B) = 1/5, and Pr(A|B) + Pr(B|A) = 2/3. Evaluate 
Pr(A® U BY). 


4. Suppose that A and B are independent events such that 
Pr(A) = 1/3 and Pr(B) > 0. What is the value of Pr(A U 
B‘|B)? 


5. Suppose that in 10 rolls of a balanced die, the number 6 
appeared exactly three times. What is the probability that 
the first three rolls each yielded the number 6? 


6. Suppose that A, B, and D are events such that A and 
B are independent, Pr(AN BM D) = 0.04, Pr(D|AN B) = 
0.25, and Pr(B) = 4 Pr(A). Evaluate Pr(A U B). 


7. Suppose that the events A, B, and C are mutually in- 
dependent. Under what conditions are A‘, B®, and C° 
mutually independent? 


8. Suppose that the events A and B are disjoint and that 
each has positive probability. Are A and B independent? 


9. Suppose that A, B, and C are three events such that A 
and B are disjoint, A and C are independent, and B and 


C are independent. Suppose also that 4Pr(A) = 2Pr(B) = 
Pr(C) > 0 and Pr(A U BUC) = 5Pr(A). Determine the 
value of Pr(A). 


10. Suppose that each of two dice is loaded so that when 
either die is rolled, the probability that the number & will 
appear is 0.1 for k = 1, 2, 5, or 6 and is 0.3 for k = 3 or 4. If 
the two loaded dice are rolled independently, what is the 
probability that the sum of the two numbers that appear 
will be 7? 


11. Suppose that there is a probability of 1/50 that you 
will win a certain game. If you play the game S50 times, 
independently, what is the probability that you will win at 
least once? 


12. Suppose that a balanced die is rolled three times, and 
let X; denote the number that appears on the ith roll 
(i =1, 2, 3). Evaluate Pr(X; > X2 > X3). 


13. Three students A, B, and C are enrolled in the same 
class. Suppose that A attends class 30 percent of the time, 
B attends class 50 percent of the time, and C attends 
class 80 percent of the time. If these students attend class 
independently of each other, what is (a) the probability 
that at least one of them will be in class on a particular 
day and (b) the probability that exactly one of them will 
be in class on a particular day? 


14. Consider the World Series of baseball, as described in 
Exercise 16 of Sec. 2.2. If there is probability p that team 
A will win any particular game, what is the probability 


that it will be necessary to play seven games in order to 
determine the winner of the Series? 


15. Suppose that three red balls and three white balls are 
thrown at random into three boxes and and that all throws 
are independent. What is the probability that each box 
contains one red ball and one white ball? 


16. If five balls are thrown at random into n boxes, and all 
throws are independent, what is the probability that no 
box contains more than two balls? 


17. Bus tickets in a certain city contain four numbers, U, 
V, W, and X. Each of these numbers is equally likely to 
be any of the 10 digits 0, 1,..., 9, and the four numbers 
are chosen independently. A bus rider is said to be lucky if 
U+V=W +X. What proportion of the riders are lucky? 


18. A certain group has eight members. In January, three 
members are selected at random to serve on a commit- 
tee. In February, four members are selected at random 
and independently of the first selection to serve on an- 
other committee. In March, five members are selected at 
random and independently of the previous two selections 
to serve on a third committee. Determine the probability 
that each of the eight members serves on at least one of 
the three committees. 


19. For the conditions of Exercise 18, determine the prob- 
ability that two particular members A and B will serve 
together on at least one of the three committees. 


20. Suppose that two players A and B take turns rolling a 
pair of balanced dice and that the winner is the first player 
who obtains the sum of 7 on a given roll of the two dice. 
If A rolls first, what is the probability that B will win? 


21. Three players A, B, and C take turns tossing a fair 
coin. Suppose that A tosses the coin first, B tosses second, 
and C tosses third; and suppose that this cycle is repeated 
indefinitely until someone wins by being the first player 
to obtain a head. Determine the probability that each of 
three players will win. 


22. Suppose that a balanced die is rolled repeatedly until 
the same number appears on two successive rolls, and let 
X denote the number of rolls that are required. Determine 
the value of Pr(X = x), for x =2,3,.... 


23. Suppose that 80 percent of all statisticians are shy, 
whereas only 15 percent of all economists are shy. Suppose 
also that 90 percent of the people at a large gathering are 
economists and the other 10 percent are statisticians. If 
you meet a shy person at random at the gathering, what is 
the probability that the person is a statistician? 


24. Dreamboat cars are produced at three different fac- 
tories A, B, and C. Factory A produces 20 percent of the 
total output of Dreamboats, B produces 50 percent, and 
C produces 30 percent. However, 5 percent of the cars 
produced at A are lemons, 2 percent of those produced 
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at B are lemons, and 10 percent of those produced at C 
are lemons. If you buy a Dreamboat and it turns out to be 
a lemon, what is the probability that it was produced at 
factory A? 


25. Suppose that 30 percent of the bottles produced in 
a certain plant are defective. If a bottle is defective, the 
probability is 0.9 that an inspector will notice it and re- 
move it from the filling line. If a bottle is not defective, 
the probability is 0.2 that the inspector will think that it is 
defective and remove it from the filling line. 


a. Ifa bottle is removed from the filling line, what is the 
probability that it is defective? 


b. Ifacustomer buys a bottle that has not been removed 
from the filling line, what is the probability that it is 
defective? 


26. Suppose that a fair coin is tossed until a head is ob- 
tained and that this entire experiment is then performed 
independently a second time. What is the probability that 
the second experiment requires more tosses than the first 
experiment? 


27. Suppose that a family has exactly n children (n > 2). 
Assume that the probability that any child will be a girl 
is 1/2 and that all births are independent. Given that the 
family has at least one girl, determine the probability that 
the family has at least one boy. 


28. Suppose that a fair coin is tossed independently n 
times. Determine the probability of obtaining exactly n — 
1 heads, given (a) that at least n — 2 heads are obtained 
and (b) that heads are obtained on the first n — 2 tosses. 


29. Suppose that 13 cards are selected at random from a 
regular deck of 52 playing cards. 


a. If it is known that at least one ace has been selected, 
what is the probability that at least two aces have 
been selected? 


b. Ifit is known that the ace of hearts has been selected, 
what is the probability that at least two aces have 
been selected? 


30. Suppose that n letters are placed at random in n en- 
velopes, as in the matching problem of Sec. 1.10, and let g,, 
denote the probability that no letter is placed in the cor- 
rect envelope. Show that the probability that exactly one 
letter is placed in the correct envelope is g,,_. 


31. Consider again the conditions of Exercise 30. Show 
that the probability that exactly two letters are placed in 
the correct envelopes is (1/2)q,,_>. 


32. Consider again the conditions of Exercise 7 of Sec. 2.2. 
If exactly one of the two students A and B is in class ona 
given day, what is the probability that it is A? 


33. Consider again the conditions of Exercise 2 of Sec. 
1.10. If a family selected at random from the city 
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subscribes to exactly one of the three newspapers A, B, 
and C, what is the probability that it is A? 


34. Three prisoners A, B, and C on death row know that 
exactly two of them are going to be executed, but they do 
not know which two. Prisoner A knows that the jailer will 
not tell him whether or not he is going to be executed. He 
therefore asks the jailer to tell him the name of one pris- 
oner other than A himself who will be executed. The jailer 
responds that B will be executed. Upon receiving this re- 
sponse, Prisoner A reasons as follows: Before he spoke to 
the jailer, the probability was 2/3 that he would be one of 
the two prisoners executed. After speaking to the jailer, 
he knows that either he or prisoner C will be the other 
one to be executed. Hence, the probability that he will be 
executed is now only 1/2. Thus, merely by asking the jailer 
his question, the prisoner reduced the probability that he 
would be executed from 2/3 to 1/2, because he could go 
through exactly this same reasoning regardless of which 
answer the jailer gave. Discuss what is wrong with prisoner 
A’s reasoning. 


35. Suppose that each of two gamblers A and B has an 
initial fortune of 50 dollars, and that there is probability 
p that gambler A will win on any single play of a game 
against gambler B. Also, suppose either that one gambler 
can win one dollar from the other on each play of the game 
or that they can double the stakes and one can win two 
dollars from the other on each play of the game. Under 
which of these two conditions does A have the greater 
probability of winning the initial fortune of B before losing 
her own for each of the following conditions: (a) p < 1/2; 
(b) p > 1/2; (©) p=1/2? 


36. A sequence of n job candidates is prepared to inter- 
view for a job. We would like to hire the best candidate, 
but we have no information to distinguish the candidates 


before we interview them. We assume that the best candi- 
date is equally likely to be each of the n candidates in the 
sequence before the interviews start. After the interviews 
start, we are able to rank those candidates we have seen, 
but we have no information about where the remaining 
candidates rank relative to those we have seen. After each 
interview, it is required that either we hire the current can- 
didate immediately and stop the interviews, or we must let 
the current candidate go and we never can call them back. 
We choose to interview as follows: We select a number 
0 <r <n and we interview the first r candidates without 
any intention of hiring them. Starting with the next can- 
didate r + 1, we continue interviewing until the current 
candidate is the best we have seen so far. We then stop 
and hire the current candidate. If none of the candidates 
from r + 1 to n is the best, we just hire candidate n. We 
would like to compute the probability that we hire the best 
candidate and we would like to choose r to make this prob- 
ability as large as possible. Let A be the event that we hire 
the best candidate, and let B; be the event that the best 
candidate is in position 7 in the sequence of interviews. 


a. Leti >. Find the probability that the candidate who 
is relatively the best among the first 7 interviewed 
appears in the first r interviews. 


b. Prove that Pr(A|B;) =0 for i <r and Pr(A|B;) = 
r/Gi —1)fori>r. 

c. For fixed r, let p, be the probability of A using that 
value of r. Prove that p, = (r/n) )77_,4,4 — 1\=4. 

d. Let g, =p, — p,_1 for r=1,...,n—1, and prove 
that q, is a strictly decreasing function of r. 

e. Show that a value of r that maximizes p, is the last r 
such that q, > 0. (Hint: Write p, = pp +q, +---+49, 
for r > 0.) 


f. Forn = 10, find the value of r that maximizes p,, and 
find the corresponding p, value. 


Chapter 


RANDOM VARIABLES 
AND DISTRIBUTIONS 


3.1 
3.2 
3:3 
3.4 
3:5 
3.6 


Random Variables and Discrete Distributions 3.7. Multivariate Distributions 

Continuous Distributions 3.8 Functions of a Random Variable 

The Cumulative Distribution Function 3.9 Functions of Two or More Random Variables 
Bivariate Distributions 3.10 Markov Chains 

Marginal Distributions 3.11 Supplementary Exercises 


Conditional Distributions 


Example 
3.1.1 


Definition 
3.1.1 


3.1 Random Variables and Discrete Distributions 


A random variable is a real-valued function defined on a sample space. Random 
variables are the main tools used for modeling unknown quantities in statistical 
analyses. For each random variable X and each set C of real numbers, we could 
calculate the probability that X takes its value in C. The collection of all of these 
probabilities is the distribution of X. There are two major classes of distributions 
and random variables: discrete (this section) and continuous (Sec. 3.2). Discrete 
distributions are those that assign positive probability to at most countably many 
different values. A discrete distribution can be characterized by its probability 
function (p.f.), which specifies the probability that the random variable takes each 
of the different possible values. A random variable with a discrete distribution will 
be called a discrete random variable. 


Definition of a Random Variable 


Tossing a Coin. Consider an experiment in which a fair coin is tossed 10 times. In this 
experiment, the sample space S can be regarded as the set of outcomes consisting of 
the 2'° different sequences of 10 heads and/or tails that are possible. We might be 
interested in the number of heads in the observed outcome. We can let X stand for the 
real-valued function defined on S that counts the number of heads in each outcome. 
For example, if s is the sequence HHTTTHTTTH, then X (s) = 4. For each possible 
sequence s consisting of 10 heads and/or tails, the value X(s) equals the number of 
heads in the sequence. The possible values for the function X are 0, 1,..., 10. < 


Random Variable. Let S be the sample space for an experiment. A real-valued func- 
tion that is defined on S is called a random variable. 


For example, in Example 3.1.1, the number X of heads in the 10 tosses is a random 
variable. Another random variable in that example is Y = 10 — X, the number of 
tails. 
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Figure 3.1 The event that 
at least one utility demand is 
high in Example 3.1.3. 
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3.1.2 


Example 
3.1.3 
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3.1.2 


Electric 


| t t > Water 


Measuring a Person’s Height. Consider an experiment in which a person is selected at 
random from some population and her height in inches is measured. This height is a 
random variable. < 


Demands for Utilities. Consider the contractor in Example 1.5.4 on page 19 who is 
concerned about the demands for water and electricity in a new office complex. The 
sample space was pictured in Fig. 1.5 on page 12, and it consists of a collection of 
points of the form (x, y), where x is the demand for water and y is the demand 
for electricity. That is, each point s € S is a pair s = (x, y). One random variable 
that is of interest in this problem is the demand for water. This can be expressed 
as X(s) =x whens = (x, y). The possible values of X are the numbers in the interval 
[4, 200]. Another interesting random variable is Y, equal to the electricity demand, 
which can be expressed as Y(s) = y when s = (x, y). The possible values of Y are the 
numbers in the interval [1, 150]. A third possible random variable Z is an indicator of 
whether or not at least one demand is high. Let A and B be the two events described 
in Example 1.5.4. That is, A is the event that water demand is at least 100, and B is 
the event that electric demand is at least 115. Define 


1 ifs e AUB, 
Z(s) = ; 
0 ifs ZAUB. 
The possible values of Z are the numbers 0 and 1. The event A U B is indicated in 
Fig. 3.1. < 


The Distribution of a Random Variable 


When a probability measure has been specified on the sample space of an experiment, 
we can determine probabilities associated with the possible values of each random 
variable X. Let C be a subset of the real line such that {X € C} is an event, and let 
Pr(X € C) denote the probability that the value of X will belong to the subset C. 
Then Pr(X € C) is equal to the probability that the outcome s of the experiment will 
be such that X(s) € C. In symbols, 


Pr(X € C) = Pr({s: X(s) € C}). (3.1.1) 
Distribution. Let X be a random variable. The distribution of X is the collection of all 


probabilities of the form Pr(X € C) for all sets C of real numbers such that {X € C} 
is an event. 


It is a straightforward consequence of the definition of the distribution of X that 
this distribution is itself a probability measure on the set of real numbers. The set 


Figure 3.2 The event that 
water demand is between 50 
and 175 in Example 3.1.5. 
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Electric 


| t t | > Water 
9; 4° 100 175 200 


{X € C} will be an event for every set C of real numbers that most readers will be 
able to imagine. 


Tossing a Coin. Consider again an experiment in which a fair coin is tossed 10 times, 
and let X be the number of heads that are obtained. In this experiment, the possible 
values of X are 0, 1, 2,..., 10. For each x, Pr(X = x) is the sum of the probabilities 
of all of the outcomes in the event {X = x}. Because the coin is fair, each outcome 
has the same probability 1/2'°, and we need only count how many outcomes s have 
X(s) =x. We know that X (s) = x if and only if exactly x of the 10 tosses are H. Hence, 
the number of outcomes s with X (s) = x is the same as the number of subsets of size 
x (to be the heads) that can be chosen from the 10 tosses, namely, ‘eae according to 
Definitions 1.8.1 and 1.8.2. Hence, 


10\ 1 
Pyx=n= (1) for x =0,1,2,...,10. < 


Demands for Utilities. In Example 1.5.4, we actually calculated some features of the 
distributions of the three random variables X, Y, and Z defined in Example 3.1.3. 
For example, the event A, defined as the event that water demand is at least 100, can 
be expressed as A = {X > 100}, and Pr(A) = 0.5102. This means that Pr(X > 100) = 
0.5102. The distribution of X consists of all probabilities of the form Pr(X € C) for all 
sets C such that {X € C} is an event. These can all be calculated in a manner similar 
to the calculation of Pr(A) in Example 1.5.4. In particular, if C is a subinterval of the 
interval [4, 200], then 


(150 — 1) x (length of interval C) 
29,204 


For example, if C is the interval [50,175], then its length is 125, and Pr(X € C) = 
149 x 125/29,204 = 0.6378. The subset of the sample space whose probability was 
just calculated is drawn in Fig. 3.2. < 


Pr(X €C) = (3.1.2) 


The general definition of distribution in Definition 3.1.2 is awkward, and it will 
be useful to find alternative ways to specify the distributions of random variables. In 
the remainder of this section, we shall introduce a few such alternatives. 


Discrete Distributions 


Discrete Distribution/Random Variable. We say that a random variable X has a discrete 
distribution or that X is a discrete random variable if X can take only a finite number 
k of different values x,,..., x; or, at most, an infinite sequence of different values 
X41, X92, 06% 
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Figure 3.3 An example of 
a pf. 


Random variables that can take every value in an interval are said to have continuous 
distributions and are discussed in Sec. 3.2. 


Probability Function/p.f./Support. If a random variable X has a discrete distribution, 
the probability function (abbreviated p.f) of X is defined as the function f such that 
for every real number x, 


f(x) =Pr(X =x). 
The closure of the set {x : f(x) > 0} is called the support of (the distribution of) X. 


Some authors refer to the probability function as the probability mass function, or 
p.m.f. We will not use that term again in this text. 


Demands for Utilities. The random variable Z in Example 3.1.3 equals 1 if at least one 
of the utility demands is high, and Z = 0 if neither demand is high. Since Z takes only 
two different values, it has a discrete distribution. Note that {s:Z(s) =1}=AUB, 
where A and B are defined in Example 1.5.4. We calculated Pr(A U B) = 0.65253 in 
Example 1.5.4. If Z has pf. f, then 


0.65253 ifz=1, 
f(Z) = 4 0.34747 ifz=0, 
0 otherwise. 
The support of Z is the set {0, 1}, which has only two elements. <l 


Tossinga Coin. The random variable X in Example 3.1.4 has only 11 different possible 
values. Its p.f. f is given at the end of that example for the values x =0,..., 10 that 
constitute the support of X; f(x) = 0 for all other values of x. J 


Here are some simple facts about probability functions 

Let X be a discrete random variable with p.f. f. If x is not one of the possible values 

of X, then f(x) = 0. Also, if the sequence x1, x2, ... includes all the possible values 

of X, then °°, f(x) =1. | 
A typical p.f. is sketched in Fig. 3.3, in which each vertical segment represents 


the value of f(x) corresponding to a possible value x. The sum of the heights of the 
vertical segments in Fig. 3.3 must be 1. 
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Theorem 3.1.2 shows that the p.f. of a discrete random variable characterizes its 
distribution, and it allows us to dispense with the general definition of distribution 
when we are discussing discrete random variables. 


If X has a discrete distribution, the probability of each subset C of the real line can 
be determined from the relation 


Pr(X €C)= D> f(x). 7 


xjEC 


Some random variables have distributions that appear so frequently that the 
distributions are given names. The random variable Z in Example 3.1.6 is one such. 


Bernoulli Distribution/Random Variable. A random variable Z that takes only two 
values 0 and 1 with Pr(Z = 1) = p has the Bernoulli distribution with parameter p. 
We also say that Z is a Bernoulli random variable with parameter p. 


The Z in Example 3.1.6 has the Bernoulli distribution with parameter 0.65252. It 
is easy to see that the name of each Bernoulli distribution is enough to allow us to 
compute the p.f., which, in turn, allows us to characterize its distribution. 

We conclude this section with illustrations of two additional families of discrete 
distributions that arise often enough to have names. 


Uniform Distributions on Integers 


Daily Numbers. A popular state lottery game requires participants to select a three- 
digit number (leading Os allowed). Then three balls, each with one digit, are chosen at 
random from well-mixed bowls. The sample space here consists of all triples (71, i2, i3) 
where ij€ {0,...,9} for j =1, 2, 3. If s = (i, in, i3), define X (s) = 100i; + 107, + i3. 
For example, X (0, 1, 5) = 15. It is easy to check that Pr(X = x) = 0.001 for each 
integer x € {0, 1,..., 999}. < 


Uniform Distribution on Integers. Let a < b be integers. Suppose that the value of a 
random variable X is equally likely to be each of the integers a, ..., b. Then we say 
that X has the uniform distribution on the integers a, ..., b. 


The X in Example 3.1.8 has the uniform distribution on the integers 0, 1,..., 999. 
A uniform distribution on a set of & integers has probability 1/k on each integer. 
If b > a, there are b — a + 1 integers from a to b including a and b. The next result 
follows immediately from what we have just seen, and it illustrates how the name of 
the distribution characterizes the distribution. 


If X has the uniform distribution on the integers a, ..., b, the p.f. of X is 


1 
——— f =a,...,b 
(Gj) Goad orx=a,...,b, 


0 otherwise. | 


The uniform distribution on the integers a, ..., b represents the outcome of an 
experiment that is often described by saying that one of the integersa, ..., bis chosen 
at random. In this context, the phrase “at random” means that each of the b-a+1 
integers is equally likely to be chosen. In this same sense, it is not possible to choose 
an integer at random from the set of all positive integers, because it is not possible 
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to assign the same probability to every one of the positive integers and still make the 
sum of these probabilities equal to 1. In other words, a uniform distribution cannot 
be assigned to an infinite sequence of possible values, but such a distribution can be 
assigned to any finite sequence. 


Note: Random Variables Can Have the Same Distribution without Being the 
Same Random Variable. Consider two consecutive daily number draws as in Ex- 
ample 3.1.8. The sample space consists of all 6-tuples (1, ..., i), where the first 
three coordinates are the numbers drawn on the first day and the last three are the 
numbers drawn on the second day (all in the order drawn). If s = (i, ..., i¢), let 
X1(s) = 1007, + 10i, + iz and let X>(s) = 100i, + 1075 + ig. It is easy to see that X, 
and X, are different functions of s and are not the same random variable. Indeed, 
there is only a small probability that they will take the same value. But they have 
the same distribution because they assume the same values with the same probabil- 
ities. If a businessman has 1000 customers numbered 0, ... , 999, and he selects one 
at random and records the number Y, the distribution of Y will be the same as the 
distribution of X, and of X>, but Y is not like X, or X> in any other way. 


Binomial Distributions 


Defective Parts. Consider again Example 2.2.5 from page 69. In that example, a ma- 
chine produces a defective item with probability p (0 < p < 1) and produces a non- 
defective item with probability 1 — p. We assumed that the events that the different 
items were defective were mutually independent. Suppose that the experiment con- 
sists of examining n of these items. Each outcome of this experiment will consist of 
a list of which items are defective and which are not, in the order examined. For ex- 
ample, we can let 0 stand for a nondefective item and 1 stand for a defective item. 
Then each outcome is a string of n digits, each of which is 0 or 1. To be specific, if, 
say, n = 6, then some of the possible outcomes are 


010010, 100100, 000011, 110000, 100001, 000000, etc. (3.1.3) 


We will let X denote the number of these items that are defective. Then the random 
variable X will have a discrete distribution, and the possible values of X will be 
0,1,2,...,n. For example, the first four outcomes listed in Eq. (3.1.3) all have 
X(s) =2. The last outcome listed has X (s) = 0. <J 


Example 3.1.9 is a generalization of Example 2.2.5 with n items inspected rather 
than just six, and rewritten in the notation of random variables. For x =0, 1,..., 1, 
the probability of obtaining each particular ordered sequence of n items containing 
exactly x defectives and n — x nondefectives is p*(1 — p)"~*, just as it was in Ex- 
ample 2.2.5. Since there are (") different ordered sequences of this type, it follows 
that 


Pr(x = x) — (") pd = py. 


Therefore, the p.f. of X will be as follows: 


n = = 
poy= | (#) p’d— p)"~ forx=0,1,...,n, (3.1.4) 
0 otherwise. 


Binomial Distribution/Random Variable. The discrete distribution represented by the 
p-f. in (3.1.4) is called the binomial distribution with parameters n and p. A random 
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variable with this distribution is said to be a binomial random variable with parame- 
ters n and p. 


The reader should be able to verify that the random variable X in Example 3.1.4, 
the number of heads in a sequence of 10 independent tosses of a fair coin, has the 
binomial distribution with parameters 10 and 1/2. 

Since the name of each binomial distribution is sufficient to construct its p.f., it 
follows that the name is enough to identify the distribution. The name of each distri- 
bution includes the two parameters. The binomial distributions are very important in 
probability and statistics and will be discussed further in later chapters of this book. 

A short table of values of certain binomial distributions is given at the end 
of this book. It can be found from this table, for example, that if X has the bino- 
mial distribution with parameters n = 10 and p = 0.2, then Pr(X = 5) = 0.0264 and 
Pr(X > 5) = 0.0328. 

As another example, suppose that a clinical trial is being run. Suppose that the 
probability that a patient recovers from her symptoms during the trial is p and that 
the probability is 1 — p that the patient does not recover. Let Y denote the number of 
patients who recover out of m independent patients in the trial. Then the distribution 
of Y is also binomial with parameters n and p. Indeed, consider a general experiment 
that consists of observing n independent repititions (trials) with only two possible 
results for each trial. For convenience, call the two possible results “success” and 
“failure.” Then the distribution of the number of trials that result in success will be 
binomial with parameters n and p, where p is the probability of success on each trial. 


Note: Names of Distributions. In this section, we gave names to several families 
of distributions. The name of each distribution includes any numerical parameters 
that are part of the definition. For example, the random variable X in Example 3.1.4 
has the binomial distribution with parameters 10 and 1/2. It is a correct statement to 
say that X has a binomial distribution or that X has a discrete distribution, but such 
statements are only partial descriptions of the distribution of X. Such statements 
are not sufficient to name the distribution of X, and hence they are not sufficient as 
answers to the question “What is the distribution of X?” The same considerations 
apply to all of the named distributions that we introduce elsewhere in the book. When 
attempting to specify the distribution of a random variable by giving its name, one 
must give the full name, including the values of any parameters. Only the full name 
is sufficient for determining the distribution. 


Summary 


A random variable is a real-valued function defined on a sample space. The distri- 
bution of a random variable X is the collection of all probabilities Pr(X € C) for all 
subsets C of the real numbers such that {X € C} is an event. A random variable X is 
discrete if there are at most countably many possible values for X. In this case, the 
distribution of X can be characterized by the probability function (p.f.) of X, namely, 
f(x) = Pr(X = x) for x in the set of possible values. Some distributions are so famous 
that they have names. One collection of such named distributions is the collection of 
uniform distributions on finite sets of integers. A more famous collection is the col- 
lection of binomial distributions whose parameters are n and p, where n is a positive 
integer and 0 < p <1, having pf. (3.1.4). The binomial distribution with parameters 
n=1and pis also called the Bernoulli distribution with parameter p. The names of 
these distributions also characterize the distributions. 
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Exercises 


1. Suppose that a random variable X has the uniform dis- 
tribution on the integers 10, ..., 20. Find the probability 
that X is even. 


2. Suppose that a random variable X has a discrete distri- 
bution with the following p.f.: 


ees pi forx=1,...,5, 


Q otherwise. 
Determine the value of the constant c. 


3. Suppose that two balanced dice are rolled, and let X 
denote the absolute value of the difference between the 
two numbers that appear. Determine and sketch the p.f. 
of X. 


4. Suppose that a fair coin is tossed 10 times indepen- 
dently. Determine the p.f. of the number of heads that will 
be obtained. 


5. Suppose that a box contains seven red balls and three 
blue balls. If five balls are selected at random, without 
replacement, determine the p.f. of the number of red balls 
that will be obtained. 


6. Suppose that a random variable X has the binomial dis- 
tribution with parameters n = 15 and p = 0.5. Find Pr(Xx < 
6). 


7. Suppose that a random variable X has the binomial dis- 
tribution with parameters n = 8 and p = 0.7. Find Pr(x > 
5) by using the table given at the end of this book. Hint: 


Use the fact that Pr(X > 5) = Pr(Y < 3), where Y has the 
binomial distribution with parameters n = 8 and p = 0.3. 


8. If 10 percent of the balls in a certain box are red, and 
if 20 balls are selected from the box at random, with re- 
placement, what is the probability that more than three 
red balls will be obtained? 


9. Suppose that a random variable X has a discrete distri- 
bution with the following p.f.: 


ia for x= 0,1, 25 <26s 
ff) = . 
0 otherwise. 


Find the value of the constant c. 


10. A civil engineer is studying a left-turn lane that is 
long enough to hold seven cars. Let X be the number 
of cars in the lane at the end of a randomly chosen red 
light. The engineer believes that the probability that X = 
x is proportional to (x + 1)(8 — x) for x =0,..., 7 (the 
possible values of X). 


a. Find the pf. of X. 
b. Find the probability that X will be at least 5. 


11. Show that there does not exist any number c such that 
the following function would be a p.f.: 


& forx=1,.2,.....,; 
fate . 
0 otherwise. 


3.2 Continuous Distributions 


Next, we focus on random variables that can assume every value in an interval 
(bounded or unbounded). If arandom variable X has associated with it a function 
f such that the integral of f over each interval gives the probability that X is in the 
interval, then we call f the probability density function (p.d.f.) of X and we say 
that X has a continuous distribution. 


The Probability Density Function 


Example 
3.2.1 


Demands for Utilities. In Example 3.1.5, we determined the distribution of the de- 
mand for water, X. From Fig. 3.2, we see that the smallest possible value of X is 4 


and the largest is 200. For each interval C = [cp, c,] c [4, 200], Eq. (3.1.2) says that 


Pr(cg < X <cy) = - 


149(c, = Co) _ Cy — Co =) SP 
co 


29204 196 196 


Definition 
3.2.1 


Definition 
3.2.2 
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So, if we define 


1 
— if4<x <200, 
f(x)=4 196 = (3.2.1) 
0 otherwise, 
we have that 
C1 
Pr(cg < X <c}) =| f(x)dx. (3.2.2) 
co 


Because we defined f (x) to be 0 for x outside of the interval [4, 200], we see that Eq. 
(3.2.2) holds for all cp < cy, even if cg = —00 and/or c; = ov. < 


The water demand X in Example 3.2.1 is an example of the following. 


Continuous Distribution/Random Variable. We say that a random variable X has a 
continuous distribution or that X is a continuous random variable if there exists a 
nonnegative function f, defined on the real line, such that for every interval of real 
numbers (bounded or unbounded), the probability that X takes a value in the interval 
is the integral of f over the interval. 


For example, in the situation described in Definition 3.2.1, for each bounded closed 
interval [a, b], 


b 
Pr(a< X <b)= f(x) dx. (3.2.3) 


Similarly, Pr(X > a) = fe f (x) dx and Pr(X <b) = i f (x) dx. 

We see that the function f characterizes the distribution of a continuous ran- 
dom variable in much the same way that the probability function characterizes the 
distribution of a discrete random variable. For this reason, the function f plays an 
important role, and hence we give it a name. 


Probability Density Function/p.d.f./Support. If X has a continuous distribution, the 
function f described in Definition 3.2.1 is called the probability density function 
(abbreviated p.d.f) of X. The closure of the set {x : f(x) > 0} is called the support 
of (the distribution of) X. 


Example 3.2.1 demonstrates that the water demand X has p.d.f. given by (3.2.1). 
Every p.d.f. f must satisfy the following two requirements: 


f(x)=0, for all x, (3.2.4) 


and 
i. f(x) dx =1. (3.2.5) 


A typical p.d.f. is sketched in Fig. 3.4. In that figure, the total area under the curve 
must be 1, and the value of Pr(a < X <b) is equal to the area of the shaded region. 


Note: Continuous Distributions Assign Probability 0 to Individual Values. The 
integral in Eq. (3.2.3) also equals Pr(a < X <b) as well as Pr(a < X <b) and Pr(a < 
X <b). Hence, it follows from the definition of continuous distributions that, if X 
has a continuous distribution, Pr(X = a) =0 for each number a. As we noted on 
page 20, the fact that Pr(X = a) = 0 does not imply that X =a is impossible. If it did, 
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Figure 3.4 Anexample of a 
p.d.f. 


SODA 


#Y 


all values of X would be impossible and X couldn’t assume any value. What happens 
is that the probability in the distribution of X is spread so thinly that we can only see 
it on sets like nondegenerate intervals. It is much the same as the fact that lines have 
0 area in two dimensions, but that does not mean that lines are not there. The two 
vertical lines indicated under the curve in Fig. 3.4 have 0 area, and this signifies that 
Pr(X = a) = Pr(X = b) = 0. However, for each € > 0 and each a such that f(a) > 0, 
Pr(a—€ <X <a+e) &2ef(a)>0. 


Nonuniqueness of the p.d.f. 


If a random variable X has a continuous distribution, then Pr(X = x) = 0 for every 
individual value x. Because of this property, the values of each p.d.f. can be changed 
at a finite number of points, or even at certain infinite sequences of points, without 
changing the value of the integral of the p.d.f over any subset A. In other words, 
the values of the p.d.f. of a random variable X can be changed arbitrarily at many 
points without affecting any probabilities involving X, that is, without affecting the 
probability distribution of X. At exactly which sets of points we can change a p.d.f. 
depends on subtle features of the definition of the Riemann integral. We shall not 
deal with this issue in this text, and we shall only contemplate changes to p.d.f’s at 
finitely many points. 

To the extent just described, the p.d.f. of arandom variable is not unique. In many 
problems, however, there will be one version of the p.d.f. that is more natural than 
any other because for this version the p.d.f. will, wherever possible, be continuous on 
the real line. For example, the p.d-f. sketched in Fig. 3.4 is a continuous function over 
the entire real line. This p.d.f. could be changed arbitrarily at a few points without 
affecting the probability distribution that it represents, but these changes would 
introduce discontinuities into the p.d.f. without introducing any apparent advantages. 

Throughout most of this book, we shall adopt the following practice: If a random 
variable X has a continuous distribution, we shall give only one version of the p.d.f. 
of X and we shall refer to that version as the p.d.f. of X, just as though it had been 
uniquely determined. It should be remembered, however, that there is some freedom 
in the selection of the particular version of the p.d.f. that is used to represent each 
continuous distribution. The most common place where such freedom will arise is 
in cases like Eq. (3.2.1) where the p.d.f. is required to have discontinuities. Without 
making the function f any less continuous, we could have defined the p.d-f. in that 
example so that f(4) = f(200) = 0 instead of f(4) = f (200) = 1/196. Both of these 
choices lead to the same calculations of all probabilities associated with X, and they 


Example 
3.2.2 


Definition 
3.2.3 


Theorem 
3.2.1 
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are both equally valid. Because the support of a continuous distribution is the closure 
of the set where the p.d.f. is strictly positive, it can be shown that the support is unique. 
A sensible approach would then be to choose the version of the p.d.f. that was strictly 
positive on the support whenever possible. 

The reader should note that “continuous distribution” is not the name of a 
distribution, just as “discrete distribution” is not the name of a distribution. There are 
many distributions that are discrete and many that are continuous. Some distributions 
of each type have names that we either have introduced or will introduce later. 

We shall now present several examples of continuous distributions and their 
p.d.f’s. 


Uniform Distributions on Intervals 


Temperature Forecasts. Television weather forecasters announce high and low tem- 
perature forecasts as integer numbers of degrees. These forecasts, however, are the 
results of very sophisticated weather models that provide more precise forecasts that 
the television personalities round to the nearest integer for simplicity. Suppose that 
the forecaster announces a high temperature of y. If we wanted to know what tem- 
perature X the weather models actually produced, it might be safe to assume that X 
was equally likely to be any number in the interval from y — 1/2 to y + 1/2. < 


The distribution of X in Example 3.2.2 is a special case of the following. 


Uniform Distribution on an Interval. Let a and b be two given real numbers such that 
a <b. Let X be a random variable such that it is known that a < X < b and, for 
every subinterval of [a, b], the probability that X will belong to that subinterval is 
proportional to the length of that subinterval. We then say that the random variable 
X has the uniform distribution on the interval [a, b]. 


A random variable X with the uniform distribution on the interval [a, b] represents 
the outcome of an experiment that is often described by saying that a point is chosen 
at random from the interval [a, b]. In this context, the phrase “at random” means 
that the point is just as likely to be chosen from any particular part of the interval as 
from any other part of the same length. 


Uniform Distribution p.d.f. If X has the uniform distribution on an interval [a, b], then 
the p.d.f. of X is 


! fe <x<b 
(OS hese. So =? (3.2.6) 
0 otherwise. 


Proof X must take a value in the interval [a, b]. Hence, the p.d-f. f(x) of X must 
be 0 outside of [a, b]. Furthermore, since any particular subinterval of [a, b] having 
a given length is as likely to contain X as is any other subinterval having the same 
length, regardless of the location of the particular subinterval in [a, b], it follows that 
f (x) must be constant throughout [a, b], and that interval is then the support of the 
distribution. Also, 


oe) b 
/ f@)dx= / f@) deal. (3.2.7) 


Therefore, the constant value of f(x) throughout [a, b] must be 1/(b — a), and the 
p.d.f. of X must be (3.2.6). rT] 
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Figure 3.5 The p.d.f. for the 
uniform distribution on the 
interval [a, b]. 


Example 
3.2.3 


FODA 


“Y 


Th p.d.f. (3.2.6) is sketched in Fig. 3.5. As an example, the random variable X (demand 
for water) in Example 3.2.1 has the uniform distribution on the interval [4, 200]. 


Note: Density Is Not Probability. The reader should note that the p.d.f. in (3.2.6) can 
be greater than 1, particularly if b — a < 1. Indeed, p.d.f.’s can be unbounded, as we 
shall see in Example 3.2.6. The p.d-f. of X, f(x), itself does not equal the probability 
that X is near x. The integral of f over values near x gives the probability that X is 
near x, and the integral is never greater than 1. 

It is seen from Eq. (3.2.6) that the p.d.f. representing a uniform distribution on 
a given interval is constant over that interval, and the constant value of the p.d.f. 
is the reciprocal of the length of the interval. It is not possible to define a uniform 
distribution over an unbounded interval, because the length of such an interval is 
infinite. 

Consider again the uniform distribution on the interval [a, b]. Since the proba- 
bility is 0 that one of the endpoints a or b will be chosen, it is irrelevant whether the 
distribution is regarded as a uniform distribution on the closed interval a < x < b, or 
as a uniform distribution on the open interval a < x < b, or as a uniform distribution 
on the half-open and half-closed interval (a, b] in which one endpoint is included and 
the other endpoint is excluded. 

For example, if a random variable X has the uniform distribution on the interval 
[—1, 4], then the p.d-f. of X is 


1/5 for -1l<x <4, 
0 otherwise. 


fo) =| 


Furthermore, 


2, 
prosx<2= [ foydx = 3. 
0 


Notice that we defined the p.d.f. of X to be strictly positive on the closed interval 
[—1, 4] and 0 outside of this closed interval. It would have been just as sensible to 
define the p.d.f. to be strictly positive on the open interval (—1, 4) and 0 outside of this 
open interval. The probability distribution would be the same either way, including 
the calculation of Pr(O < X <2) that we just performed. After this, when there are 
several equally sensible choices for how to define a p.d.f., we will simply choose one 
of them without making any note of the other choices. 


Other Continuous Distributions 


Incompletely Specified p.d.f. Suppose that the p.d.f. of a certain random variable X 
has the following form: 


Example 
3.2.4 


Example 
3.2.5 
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£0) cx for0<x <4, 
x)= 
0 otherwise, 
where c is a given constant. We shall determine the value of c. 
For every p.d.f., it must be true that yee f (x) =1. Therefore, in this example, 


4 
/ cx dx =8c =1. 
0 


Hence, c = 1/8. < 


Note: Calculating Normalizing Constants. The calculation in Example 3.2.3 illus- 
trates an important point that simplifies many statistical results. The p.d.f. of X was 
specified without explicitly giving the value of the constant c. However, we were able 
to figure out what was the value of c by using the fact that the integral of a p.d-f. must 
be 1. It will often happen, especially in Chapter 8 where we find sampling distribu- 
tions of summaries of observed data, that we can determine the p.d-f. of a random 
variable except for a constant factor. That constant factor must be the unique value 
such that the integral of the p.d.f. is 1, even if we cannot calculate it directly. 


Calculating Probabilities froma p.d.f. Suppose that the p.d.f. of X is as in Example 3.2.3, 
namely, 


Xx 
— forO<x <4, 
f(x) = | 8 
0 otherwise. 
We shall now determine the values of Pr(1 < X <2) and Pr(X > 2). Apply Eq. (3.2.3) 
to get 


a4 3 
Pris X<2)= | —x dx = — 
1 8 16 


and 


4 
Prix >2)= | ny eee < 
2 8 4 


Unbounded Random Variables. It is often convenient and useful to represent a con- 
tinuous distribution by a p.d.f. that is positive over an unbounded interval of the real 
line. For example, in a practical problem, the voltage X in a certain electrical system 
might be a random variable with a continuous distribution that can be approximately 
represented by the p.d.f. 


0 for x <0, 
f@=) _ 1 gy yso, (3.2.8) 
(+x) 
It can be verified that the properties (3.2.4) and (3.2.5) required of all p.d.f’s are 


satisfied by f(x). 

Even though the voltage X may actually be bounded in the real situation, the 
p.d.f. (3.2.8) may provide a good approximation for the distribution of X over its full 
range of values. For example, suppose that it is known that the maximum possible 
value of X is 1000, in which case Pr(X > 1000) = 0. When the p.d.f. (3.2.8) is used, 
we compute Pr(X > 1000) = 0.001. If (3.2.8) adequately represents the variability 
of X over the interval (0, 1000), then it may be more convenient to use the p.d.f. 
(3.2.8) than a p.d.f. that is similar to (3.2.8) for x < 1000, except for a new normalizing 
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3.2.6 


constant, and is 0 for x > 1000. This can be especially true if we do not know for sure 
that the maximum voltage is only 1000. «J 


Unbounded p.d.f.’s. Since a value of a p.d.f. is a probability density, rather than a 
probability, such a value can be larger than 1. In fact, the values of the following 
p.d.f. are unbounded in the neighborhood of x = 0: 


éx-¥3 forO<x <1, 


f@= (3.2.9) 


0 otherwise. 


It can be verified that even though the p.d-f. (3.2.9) is unbounded, it satisfies the 
properties (3.2.4) and (3.2.5) required of a p.d-f. < 


Mixed Distributions 


Example 
3.2.7 


Most distributions that are encountered in practical problems are either discrete or 
continuous. We shall show, however, that it may sometimes be necessary to consider a 
distribution that is a mixture of a discrete distribution and a continuous distribution. 


Truncated Voltage. Suppose that in the electrical system considered in Example 3.2.5, 
the voltage X is to be measured by a voltmeter that will record the actual value of 
X if X <3 but will simply record the value 3 if X > 3. If we let Y denote the value 
recorded by the voltmeter, then the distribution of Y can be derived as follows. 
First, Pr(Y = 3) = Pr(X > 3) = 1/4. Since the single value Y = 3 has probability 
1/4, it follows that Pr(O < Y <3) =3/4. Furthermore, since Y = X for 0 < X <3, this 
probability 3/4 for Y is distributed over the interval (0, 3) according to the same p.d.f. 
(3.2.8) as that of X over the same interval. Thus, the distribution of Y is specified by 
the combination of a p.d.f. over the interval (0, 3) and a positive probability at the 
point Y =3. <I 


Exercises 


>, 
“% 


Summary 


A continuous distribution is characterized by its probability density function (p.d.f.). 
A nonnegative function f is the p.d.f. of the distribution of X if, for every interval 
[a,b], Pria<X <b)= Na f (x) dx. Continuous random variables satisfy Pr(X = x) = 
0 for every value x. If the p.d.f. of a distribution is constant on an interval [a, b] and 
is 0 off the interval, we say that the distribution is uniform on the interval [a, b]. 


1. Let X be a random variable with the p.d.f. specified in Sketch this p.d.f. and determine the values of the fol- 


Example 3.2.6. Compute Pr(X < 8/27). 
2. Suppose that the p.d.f. of a random variable X is as 


follows: 


lowing probabilities: a. Pr (x < >) b. Pr (J <X< ;) 


c. Pr (x>4). 


f(x) = 3(1 =£) forbex <1, 3. Suppose that the p.d.f. of a random variable X is as 


otherwise. follows: 


foy=| #0-2) for 3 <x <3, 
0 


otherwise. 
Sketch this p.d-f. and determine the values of the following 


probabilities: a. Pr(X <0) b.Pr(—-1< xX <1) 
c. Pr(X > 2). 


4. Suppose that the p.d.f. of a random variable X is as 
follows: 
cx? forl<x <2, 


fa) =| 


0 otherwise. 


a. Find the value of the constant c and sketch the p.d.f. 
b. Find the value of Pr(X > 3/2). 


5. Suppose that the p.d.f. of a random variable X is as 
follows: 


pey=| # for 0<x <4, 


0 otherwise. 


a. Find the value of t such that Pr(X <r) = 1/4. 
b. Find the value of ¢ such that Pr(X >t) = 1/2. 


6. Let X be a random variable for which the p.d.f. is as 
given in Exercise 5. After the value of X has been ob- 
served, let Y be the integer closest to X. Find the p.f. of 
the random variable Y. 


7. Suppose that a random variable X has the uniform 
distribution on the interval [—2, 8]. Find the p.d.f. of X and 
the value of Pr(0 < X <7). 


8. Suppose that the p.d.f. of a random variable X is as 
follows: 
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a. Find the value of the constant c and sketch the p.d.f. 
b. Find the value of Pr(1 < X <2). 


9. Show that there does not exist any number c such that 
the following function f(x) would be a p.d-f.: 


poy =| OF for x > 0, 

0 otherwise. 

10. Suppose that the p.d.f. of a random variable X is as 
follows: 

for0 <x <1, 


pon em 
0 


otherwise. 


a. Find the value of the constant c and sketch the p.d.f. 
b. Find the value of Pr(X < 1/2). 


11. Show that there does not exist any number c such that 
the following function f(x) would be a p.d-f.: 


£ for0<x <1, 
fx) =} * ; 

0 otherwise. 
12. In Example 3.1.3 on page 94, determine the distri- 
bution of the random variable Y, the electricity demand. 
Also, find Pr(Y < 50). 


13. An ice cream seller takes 20 gallons of ice cream in 
her truck each day. Let X stand for the number of gallons 
that she sells. The probability is 0.1 that X = 20. If she 
doesn’t sell all 20 gallons, the distribution of X follows a 
continuous distribution with a p.d.f. of the form 


cx for0 <x < 20, 
0 otherwise, 


ro=| 


where cis aconstant that makes Pr(X < 20) = 0.9. Find the 
constant c so that Pr(X < 20) =0.9 as described above. 


3.3. The Cumulative Distribution Function 


Although a discrete distribution is characterized by its p.f and a continuous distri- 
bution is characterized by its p.d.f., every distribution has a common characteriza- 
tion through its (cumulative) distribution function (c.d.f.). The inverse of the c.d.f 
is called the quantile function, and it is useful for indicating where the probability 


—2x 
fox | ce for x — 
0 otherwise. 
is located in a distribution. 
Example 
3.3.1 


Voltage. Consider again the voltage X from Example 3.2.5. The distribution of X 
is characterized by the p.d-f. in Eq. (3.2.8). An alternative characterization that is 


more directly related to probabilities associated with X is obtained from the following 


function: 
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x 0 for x <0, 
FOJ=Pr(x 27)= / f(y)dy = / oe ae eee 
< 0 d+yy 631) 
0 for x <0, 
= | 1- for x > 0. 
1+x 
So, for example, Pr(X < 3) = F(3) = 3/4. < 


Definition and Basic Properties 


(Cumulative) Distribution Function. The distribution function or cumulative distribu- 
tion function (abbreviated c.d.f) F of a random variable X is the function 


F(x) =Pr(X <x) for -o <x <o. (3.3.2) 


It should be emphasized that the cumulative distribution function is defined as above 
for every random variable X, regardless of whether the distribution of X is discrete, 
continuous, or mixed. For the continuous random variable in Example 3.3.1, the c.d.f. 
was calculated in Eq. (3.3.1). Here is a discrete example: 


Bernoulli c.d.f. Let X have the Bernoulli distribution with parameter p defined in 
Definition 3.1.5. Then Pr(X = 0) = 1— p and Pr(X = 1) = p. Let F be the c.d-f. of X. 
It is easy to see that F(x) = 0 for x < 0 because X > 0 for sure. Similarly, F(x) = 1 for 
x > 1 because X <1 for sure. For 0 < x < 1, Pr(X < x) =Pr(X =0) = 1 — p because 
0 is the only possible value of X that is in the interval (—oo, x]. In summary, 


0 for x <0, 
F(x)=j,1-p for0<x <1, 
1 forx > 1. < 


We shall soon see (Theorem 3.3.2) that the c.d.f. allows calculation of all interval 
probabilities; hence, it characterizes the distribution of a random variable. It follows 
from Eq. (3.3.2) that the c.d.f. of each random variable X is a function F defined on 
the real line. The value of F at every point x must be a number F(x) in the interval 
[0, 1] because F(x) is the probability of the event {X <x}. Furthermore, it follows 
from Eq. (3.3.2) that the c.d-f. of every random variable X must have the following 
three properties. 


Nondecreasing. The function F(x) is nondecreasing as x increases; that is, if x1 < x2, 
then F(x) < F(x). 


Proof If x, < x7, then the event {X < x} is a subset of the event {X < x}. Hence, 
Pr{X <x } < Pr{X < x} according to Theorem 1.5.4. a 


An example of a c.d.f. is sketched in Fig. 3.6. It is shown in that figure that 0 < 
F(x) <1 over the entire real line. Also, F(x) is always nondecreasing as x increases, 
although F(x) is constant over the interval x; <x <x, and for x > x4. 


Limits at too. lim,_,_,, F(x) = 0 and lim,_,,, F(x) =1. 


Proof Asin the proof of Property 3.3.1, note that {X <x } C {X < x5} whenever x, < 
X,. The fact that Pr(X <x) approaches 0 as x + —oo now follows from Exercise 13 in 


Figure 3.6 Anexample ofa 


c.d.f. 
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Section 1.10. Similarly, the fact that Pr(X <x) approaches 1 as x + oo follows from 
Exercise 12 in Sec. 1.10. a 


The limiting values specified in Property 3.3.2 are indicated in Fig. 3.6. In this 
figure, the value of F(x) actually becomes 1 at x = x4 and then remains 1 for x > x4. 
Hence, it may be concluded that Pr(X < x4) =1 and Pr(X > x4) = 0. On the other 
hand, according to the sketch in Fig. 3.6, the value of F(x) approaches 0 as x + —oo, 
but does not actually become 0 at any finite point x. Therefore, for every finite value 
of x, no matter how small, Pr(X < x) > 0. 

A c.d.f. need not be continuous. In fact, the value of F(x) may jump at any 
finite or countable number of points. In Fig. 3.6, for instance, such jumps or points 
of discontinuity occur where x = x, and x = x3. For each fixed value x, we shall let 
F(x_) denote the limit of the values of F(y) as y approaches x from the left, that is, 
as y approaches x through values smaller than x. In symbols, 


F(x”) = lim F(y). 


y<x 


Similarly, we shall define F(x*) as the limit of the values of F(y) as y approaches x 
from the right. Thus, 


F(xt) = lim F(y). 


you 


If the c.d.f. is continuous at a given point x, then F(x~) = F(xt) = F(x) at that point. 


Continuity from the Right. A c.d.f is always continuous from the right; that is, F(x) = 
F (x7) at every point x. 


Proof Let y; > y, >--- be a sequence of numbers that are decreasing such that 
lim,,+o0 Yn =x. Then the event {X < x} is the intersection of all the events {X < y,} 
forn =1,2,.... Hence, by Exercise 13 of Sec. 1.10, 


F(x) =Pr(X <x)= lim Pr(X < y,) = F(x"). r 


It follows from Property 3.3.3 that at every point x at which a jump occurs, 


F(x*) = F(x) and F(x_) < F(x). 


110 


Chapter 3 Random Variables and Distributions 


Example 
3.3.3 


Theorem 
3.3.1 


Theorem 
3.3.2 


Theorem 
3.3.3 


In Fig. 3.6 this property is illustrated by the fact that, at the points of discontinuity 
x = x, and x = x3, the value of F(x) is taken as z, and the value of F'(x3) is taken as 
23. 


Determining Probabilities from the Distribution Function 


Voltage. In Example 3.3.1, suppose that we want to know the probability that X lies 
in the interval [2, 4]. That is, we want Pr(2 < X < 4). The c.d-f. allows us to compute 
Pr(X <4) and Pr(X < 2). These are related to the probability that we want as follows: 
Let A= {2 < X <4}, B={X <2}, and C={X <4}. Because X has a continuous 
distribution, Pr(A) is the same as the probability that we desire. We see that AU B = 
C, and it is clear that A and B are disjoint. Hence, Pr(A) + Pr(B) = Pr(C). It follows 
that 


1 
=. < 


4 3 
Pr(A) = Pr(C) — Pr(B) = F(4) - FQ)=2-F=5 


The type of reasoning used in Example 3.3.3 can be extended to find the prob- 
ability that an arbitrary random variable X will lie in any specified interval of the 
real line from the c.d.f. We shall derive this probability for four different types of 
intervals. 


For every value x, 
Pr(Xx > x) =1-— F(x). (3.3.3) 

Proof The events {X > x} and {X <x} are disjoint, and their union is the whole 
sample space S whose probability is 1. Hence, Pr(X > x) + Pr(X <x) =1. Now, 
Eq. (3.3.3) follows from Eq. (3.3.2). = 
For all values x, and xz such that x1 < x9, 

Pr(xy < X <x») = F(%) — F(x). (3.3.4) 
Proof Let A= {x, < X <x}, B={X <x}, and C = {X < x}. As in Example 3.3.3, 
A and B are disjoint, and their union is C, so 

Pr(x, < X < x9) + Pr(X < x1) =Pr(X <x). 


Subtracting Pr(X < x,) from both sides of this equation and applying Eq. (3.3.2) 
yields Eq. (3.3.4). r 


For example, if the c.d.f. of X is as sketched in Fig. 3.6, then it follows from 
Theorems 3.3.1 and 3.3.2 that Pr(X > x») = 1—z, and Pr(xy < X < x3) = 23 — 2. Also, 
since F(x) is constant over the interval x, < x < x9, then Pr(x; < X <x») =0. 

It is important to distinguish carefully between the strict inequalities and the 
weak inequalities that appear in all of the preceding relations and also in the next 
theorem. If there is a jump in F(x) at a given value x, then the values of Pr(X <x) 
and Pr(X < x) will be different. 


For each value x, 


Pr(X <x)=F(x ). (3.3.5) 


Theorem 
3.3.4 


Theorem 
3.3.5 
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Proof Let y, < y) <---beanincreasing sequence of numbers such that lim,_, .5 yy) = 
x. Then it can be shown that 


{X <x}=|J{X < yh. 


n=1 
Therefore, it follows from Exercise 12 of Sec. 1.10 that 
Pr(X <x) = lim Pr(xX <y,) 
now 


= lim F(y,)= F(x). | 


For example, for the c.d-f. sketched in Fig. 3.6, Pr(X < x3) =z) and Pr(X < x4) 
=, 

Finally, we shall show that for every value x, Pr(X = x) is equal to the amount 
of the jump that occurs in F at the point x. If F is continuous at the point x, that is, 
if there is no jump in F at x, then Pr(X =x) =0. 


For every value x, 


Pr(X =x) = F(x) — F(x"). (3.3.6) 


Proof It is always true that Pr(X =x) =Pr(X < x) — Pr(X < x). The relation (3.3.6) 
follows from the fact that Pr(X < x) = F(x) at every point and from Theorem 3.3.3. 
rT] 


In Fig. 3.6, for example, Pr(X = x1) =z; — zo, Pr(X = x3) = 23 — Zo, and the 
probability of every other individual value of X is 0. 


The c.d.f. of a Discrete Distribution 


From the definition and properties of a c.d.f. F(x), it follows that if a <b and 
if Pr(a < X <b) =0, then F(x) will be constant and horizontal over the interval 
a <x <b. Furthermore, as we have just seen, at every point x such that Pr(X = x) > 0, 
the c.d.f. will jump by the amount Pr(X = x). 

Suppose that X has a discrete distribution with the p.f. f(x). Together, the prop- 
erties of a c.d.f. imply that F(x) must have the following form: F (x) will have a jump 
of magnitude f (x;) at each possible value x; of X, and F(x) will be constant between 
every pair of successive jumps. The distribution of a discrete random variable X can 
be represented equally well by either the p.f. or the c.d-f. of X. 


The c.d.f. of a Continuous Distribution 


Let X have a continuous distribution, and let f(x) and F(x) denote its p.d-f. and the 
c.d.f., respectively. Then F is continuous at every x, 


F(x) = / : f@) dt, (3.3.7) 
and 
OPS Gy, (3.3.8) 
dx 


at all x such that f is continuous. 
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Proof Since the probability of each individual point x is 0, the c.d.f. F(x) will have 
no jumps. Hence, F(x) will be a continuous function over the entire real line. 

By definition, F(x) = Pr(X <x). Since f is the p.d.f. of X, we have from the 
definition of p.d.f. that Pr(X <x) is the right-hand side of Eq. (3.3.7). 

It follows from Eq. (3.3.7) and the relation between integrals and derivatives 
(the fundamental theorem of calculus) that, for every x at which f is continuous, 
Eq. (3.3.8) holds. r 


Thus, the c.d-f. of a continuous random variable X can be obtained from the p.d.f. 
and vice versa. Eq. (3.3.7) is how we found the c.d.f. in Example 3.3.1. Notice that 
the derivative of the F in Example 3.3.1 is 


0 for x <0, 
F'(x) = 


daa? for x > 0, 
x 


and F’ does not exist at x = 0. This verifies Eq (3.3.8) for Example 3.3.1. Here, we 
have used the popular shorthand notation F’(x) for the derivative of F at the point x. 


Calculating a p.d.f. from ac.d.f. Let the c.d.f. of a random variable be 


0 for x < 0, 
F(x)=4 x77 for0<x <1, 
1 forx > 1. 


This function clearly satisfies the three properties required of every c.d-f., as given 
earlier in this section. Furthermore, since this c.d.f. is continuous over the entire real 
line and is differentiable at every point except x = 0 and x = 1, the distribution of X 
is continuous. Therefore, the p.d.f. of X can be found at every point other than x = 0 
and x = 1 by the relation (3.3.8). The value of f(x) at the points x =0 and x =1 can 
be assigned arbitrarily. When the derivative F’(x) is calculated, it is found that f(x) 
is as given by Eq. (3.2.9) in Example 3.2.6. Conversely, if the p.d.f. of X is given by 
Eq. (3.2.9), then by using Eq. (3.3.7) it is found that F(x) is as given in this example. 

< 


The Quantile Function 


Fair Bets. Suppose that X is the amount of rain that will fall tomorrow, and X has 
c.d.f. F. Suppose that we want to place an even-money bet on X as follows: If X < xo, 
we win one dollar and if X > xg we lose one dollar. In order to make this bet fair, we 
need Pr(X < x9) = Pr(X > x9) = 1/2. We could search through all of the real numbers 
x trying to find one such that F (x) = 1/2, and then we would let xg equal the value we 
found. If F is a one-to-one function, then F has an inverse F~! and x = F~!(1/2). 

<l 


The value x9 that we seek in Example 3.3.5 is called the 0.5 quantile of X or the 
50th percentile of X because 50% of the distribution of X is at or below xo. 


Quantiles/Percentiles. Let X be a random variable with c.d.f. F. For each p strictly 
between 0 and 1, define F~!( p) to be the smallest value x such that F(x) > p. Then 
F-'(p) is called the p quantile of X or the 100p percentile of X. The function F~! 
defined here on the open interval (0, 1) is called the quantile function of X. 


Example 
3.3.6 


Example 
3.3.7 


Figure 3.7 The p.d-f. of the 
change in value of a portfolio 
with lower 1% indicated. 
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Standardized Test Scores. Many universities in the United States rely on standardized 
test scores as part of their admissions process. Thousands of people take these tests 
each time that they are offered. Each examinee’s score is compared to the collection 
of scores of all examinees to see where it fits in the overall ranking. For example, if 
83% of all test scores are at or below your score, your test report will say that you 
scored at the 83rd percentile. 5 


The notation F~!(p) in Definition 3.3.2 deserves some justification. Suppose first 
that the c.d-f. F of X is continuous and one-to-one over the whole set of possible 
values of X. Then the inverse F~! of F exists, and for each 0 < p <1, there is one 
and only one x such that F(x) = p. That x is F-\(p). Definition 3.3.2 extends the 
concept of inverse function to nondecreasing functions (such as c.d.f’s) that may be 
neither one-to-one nor continuous. 


Quantiles of Continuous Distributions When the c.d.f. of a random variable X is 
continuous and one-to-one over the whole set of possible values of X, the inverse 
F—' of F exists and equals the quantile function of X. 


Value at Risk. The manager of an investment portfolio is interested in how much 
money the portfolio might lose over a fixed time horizon. Let X be the change 
in value of the given portfolio over a period of one month. Suppose that X has 
the p.d.f. in Fig. 3.7. The manager computes a quantity known in the world of risk 
management as Value at Risk (denoted by VaR). To be specific, let Y = —X stand 
for the loss incurred by the portfolio over the one month. The manager wants to 
have a level of confidence about how large Y might be. In this example, the manager 
specifies a probability level, such as 0.99 and then finds yo, the 0.99 quantile of Y. The 
manager is now 99% sure that Y < yo, and yp is called the VaR. If X has a continuous 
distribution, then it is easy to see that yo is closely related to the 0.01 quantile of 
the distribution of X. The 0.01 quantile x9 has the property that Pr(X < x9) = 0.01. 
But Pr(X < x9) = Pr(Y > —x9) =1— Pr(Y < —xo). Hence, —xp is a 0.99 quantile of 
Y. For the p.d-f. in Fig. 3.7, we see that x9 = —4.14, as the shaded region indicates. 
Then yo = 4.14 is VaR for one month at probability level 0.99. <l 


Density 
A 


O12" 
0.10 + 
0.08 5 
0.06 5 
0.04 


0.02 -- 


> Change in value 


114. Chapter 3 Random Variables and Distributions 


Figure 3.8 The c.d.f. of a 
uniform distribution indi- 

cating how to solve for a 

quantile. 
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Uniform Distribution on an Interval. Let X have the uniform distribution on the 
interval [a, b]. The c.d.f. of X is 


0 ifx <a, 
x 
F(x) =Pr(X < x)= / 
a b-a 


1 ifx >b. 
The integral above equals (x — a)/(b—a). So, F(x) = (x —a)/(b—a) for alla<x <b, 
which is a strictly increasing function over the entire interval of possible values of X. 
The inverse of this function is the quantile function of X, which we obtain by setting 
F(x) equal to p and solving for x: 


du ifa<x<b, 


x—a 
b-a 
x—a=p(b—a), 
x=a+ p(b—a)=pb+(1-— p)a. 
Figure 3.8 illustrates how the calculation of a quantile relates to the c.d.f. 
The quantile function of X is F~'(p) = pb + (1— p)a for 0 < p <1. Inparticular, 
F-'(1/2) = (b +.a)/2. < 


=P, 


Note: Quantiles, Like c.d.f.’s, Depend on the Distribution Only. Any two random 
variables with the same distribution have the same quantile function. When we refer 
to a quantile of X, we mean a quantile of the distribution of X. 


Quantiles of Discrete Distributions It is convenient to be able to calculate quantiles 
for discrete distributions as well. The quantile function of Definition 3.3.2 exists for all 
distributions whether discrete, continuous, or otherwise. For example, in Fig. 3.6, let 
Zq < p <2. Then the smallest x such that F(x) > p is x;. For every value of x < xy, 
we have F(x) < zg < p and F(x;) = 2. Notice that F(x) =z, for all x between x, 
and x, but since x, is the smallest of all those numbers, x, is the p quantile. Because 
distribution functions are continuous from the right, the smallest x such that F(x) > p 
exists for all 0 < p < 1. For p = 1, there is no guarantee that such an x will exist. For 
example, in Fig. 3.6, F (x4) = 1, but in Example 3.3.1, F(x) <1 for all x. For p =0, 
there is never a smallest x such that F(x) = 0 because lim,_,_,, F(x) = 0. That is, if 
F (xp) = 0, then F(x) = 0 for all x < xo. For these reasons, we never talk about the 0 
or 1 quantiles. 
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Table 3.1 Quantile function 
for Example 3.3.9 


P F-1(p) 


(0, 0.1681] 0 
(0.1681, 0.5283] 1 
(0.5283, 0.8370] 2 
(0.8370, 0.9693] 3 
4 
5 


(0.9693, 0.9977] 
(0.9977, 1) 


Quantiles of a Binomial Distribution. Let X have the binomial distribution with pa- 
rameters 5 and 0.3. The binomial table in the back of the book has the p.f. f of X, 
which we reproduce here together with the c.d.f. F: 


x 0 1 2 -] 4 5 


f(x) 0.1681 0.3602 0.3087 0.1323 0.0284 0.0024 
F(x) 0.1681 0.5283 0.8370 0.9693 0.9977 1 


(A little rounding error occurred in the p.f.) So, for example, the 0.5 quantile of this 
distribution is 1, which is also the 0.25 quantile and the 0.20 quantile. The entire 
quantile function is in Table 3.1. So, the 90th percentile is 3, which is also the 95th 
percentile, etc. < 


Certain quantiles have special names. 


Median/Quartiles. The 1/2 quantile or the 50th percentile of a distribution is called its 
median. The 1/4 quantile or 25th percentile is the ower quartile. The 3/4 quantile or 
75th percentile is called the upper quartile. 


Note: The Median Is Special. The median of a distribution is one of several special 
features that people like to use when sumarizing the distribution of a random vari- 
able. We shall discuss summaries of distributions in more detail in Chapter 4. Because 
the median is such a popular summary, we need to note that there are several dif- 
ferent but similar “definitions” of median. Recall that the 1/2 quantile is the smallest 
number x such that F(x) > 1/2. For some distributions, usually discrete distributions, 
there will be an interval of numbers [x1, x) such that for all x € [x , x2), F(x) = 1/2. 
In such cases, it is common to refer to all such x (including x.) as medians of the dis- 
tribution. (See Definition 4.5.1.) Another popular convention is to call (x; + x2)/2 
the median. This last is probably the most common convention. The readers should 
be aware that, whenever they encounter a median, it might be any one of the things 
that we just discussed. Fortunately, they all mean nearly the same thing, namely that 
the number divides the distribution in half as closely as is possible. 
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Uniform Distribution on Integers. Let X have the uniform distribution on the integers 
1, 2, 3, 4. (See Definition 3.1.6.) The c.d-f. of X is 


0 ifx <1, 
1/4 ifl<x <2, 
1/72 if2<x <3, 
3/4 if3<x <4, 
1 ifx >4. 


The 1/2 quantile is 2, but every number in the interval [2, 3] might be called a median. 
The most popular choice would be 2.5. <l 


One advantage to describing a distribution by the quantile function rather than 
by the c.d_-f. is that quantile functions are easier to display in tabular form for multiple 
distributions. The reason is that the domain of the quantile function is always the 
interval (0, 1) no matter what the possible values of X are. Quantiles are also useful 
for summarizing distributions in terms of where the probability is. For example, if 
one wishes to say where the middle half of a distribution is, one can say that it lies 
between the 0.25 quantile and the 0.75 quantile. In Sec. 8.5, we shall see how to use 
quantiles to help provide estimates of unknown quantities after observing data. 

In Exercise 19, you can show how to recover the c.d.f. from the quantile function. 
Hence, the quantile function is an alternative way to characterize a distribution. 


Summary 


The c.d.f. F of a random variable X is F(x) = Pr(X < x) for all real x. This function 
is continuous from the right. If we let F(x~) equal the limit of F(y) as y approaches 
x from below, then F(x) — F(x~) =Pr(X =x). A continuous distribution has a 
continuous c.d.f. and F’(x) = f(x), the p.d-f. of the distribution, for all x at which 
F is differentiable. A discrete distribution has a c.d.f. that is constant between the 
possible values and jumps by f(x) at each possible value x. The quantile function 
F-\(p) is equal to the smallest x such that F(x) > p for 0 < p <1. 


Exercises 


1. Suppose that a random variable X has the Bernoulli 
distribution with parameter p=0.7. (See Definition 
3.1.5.) Sketch the c.d-f. of X. 


2. Suppose that a random variable X can take only the 
values —2, 0, 1, and 4, and that the probabilities of these 
values are as follows: Pr(X = —2) = 0.4, Pr(X =0) = 0.1, 
Pr(X = 1) = 0.3, and Pr(X = 4) = 0.2. Sketch the c.d.f. of 
Xx. 


3. Suppose that a coin is tossed repeatedly until a head is 
obtained for the first time, and let X denote the number 
of tosses that are required. Sketch the c.d.f. of X. 


4. Suppose that the c.d.f. F of a random variable X is as 
sketched in Fig. 3.9. Find each of the following probabili- 
ties: 


a. Pr(X = —1) 
ce Pr(xX <0) 

e. Pr(O < X <3) 
g. Pr(0< X <3) 
i. Pr(l < X <2) 
k. Pr(X >5) 


b. Pr(xX <0) 

d. Pr(x = 1) 

f. Pr(O < X <3) 

h. Pr(1 < X <2) 

j. Prix >5) 

I Pr(3 < X <4) 

5. Suppose that the c.d.f. of a random variable X is as 

follows: 
0 for x <0, 

FOX)= gx? for 0 <x <3, 

1 for x > 3. 


Find and sketch the p.d-f. of X. 


6. Suppose that the c.d.f. of a random variable X is as 
follows: 
e*-3 for x <3, 


F(x) = | 


1 for x > 3. 
Find and sketch the p.d.f. of X. 


7. Suppose, as in Exercise 7 of Sec. 3.2, that a random 
variable X has the uniform distribution on the interval 
[—2, 8]. Find and sketch the c.d.f. of X. 


8. Suppose that a point in the xy-plane is chosen at ran- 
dom from the interior of a circle for which the equation is 
x? 4 y? = 1; and suppose that the probability that the point 
will belong to each region inside the circle is proportional 
to the area of that region. Let Z denote a random variable 
representing the distance from the center of the circle to 
the point. Find and sketch the c.d-f. of Z. 


9. Suppose that X has the uniform distribution on the 
interval [0, 5] and that the random variable Y is defined 
by Y=Oif X <1, Y=Sif X >3, and Y=X otherwise. 
Sketch the c.d-f. of Y. 


10. For the c.d-f. in Example 3.3.4, find the quantile func- 
tion. 


11. For the c.d-f. in Exercise 5, find the quantile function. 
12. For the c.d-f. in Exercise 6, find the quantile function. 


13. Suppose that a broker believes that the change in 
value X of a particular investment over the next two 
months has the uniform distribution on the interval [—12, 
24]. Find the value at risk VaR for two months at proba- 
bility level 0.95. 


14. Find the quartiles and the median of the binomial 
distribution with parameters n = 10 and p = 0.2. 


3.3 The Cumulative Distribution Function 117 


15. Suppose that X has the p.d-f. 


2x fO<x <1, 
0 otherwise. 


ro) =| 


Find and sketch the c.d.f. or X. 


16. Find the quantile function for the distribution in Ex- 
ample 3.3.1. 


17. Prove that the quantile function F~! of a general ran- 
dom variable X has the following three properties that are 
analogous to properties of the c.d-f.: 


a. Fo! 
b. Let ay = Ta F-\(p) and ate as F-\(p). 


is anondecreasing function of p for 0 < p <1. 


Then xg equals the greatest lower bound « on the set 
of numbers c such that Pr(X <c) > 0, and x, equals 
the least upper bound on the set of numbers d such 
that Pr(X > d) > 0. 


c. F—! is continuous from the left; that is F~!(p) = 
F-\(p-) for all0 < p <1. 


18. Let X be a random variable with quantile function 
F—!. Assume the following three conditions: (i) F~!(p) = 
c for all p in the interval (po, pj), (ii) either pg =0 or 
F-!(po) <c, and (iii) either py = 1 or F~'(p) > ¢ for p > 
Pp. Prove that Pr(X =c) = p; — po- 


19. Let X be a random variable with c.d.f. F and quantile 
function F—!. Let xg and x, be as defined in Exercise 17. 
(Note that x9 = —oo and/or x; = oo are possible.) Prove 
that for all x in the open interval (xo, x1), F(x) is the largest 
p such that F~!(p) <x. 


20. In Exercise 13 of Sec. 3.2, draw a sketch of the c.d.f. F 
of X and find F(10). 


Figure 3.9 The c.d-f. for Exercise 4. 
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3.4 Bivariate Distributions 


We generalize the concept of distribution of a random variable to the joint distri- 
bution of two random variables. In doing so, we introduce the joint p.f. for two 
discrete random variables, the joint p.d.f for two continuous random variables, 
and the joint c.d.f for any two random variables. We also introduce a joint hybrid 
of p.f. and p.d.f. for the case of one discrete random variable and one continuous 
random variable. 


Demands for Utilities. In Example 3.1.5, we found the distribution of the random 
variable X that represented the demand for water. But there is another random 
variable, Y, the demand for electricity, that is also of interest. When discussing 
two random variables at once, it is often convenient to put them together into an 
ordered pair, (X, Y). As early as Example 1.5.4 on page 19, we actually calculated 
some probabilities associated with the pair (X, Y). In that example, we defined two 
events A and B that we now can express as A = {X > 115} and B= {Y > 110}. In 
Example 1.5.4, we computed Pr(A M B) and Pr(A U B). We can express AM B and 
AU B as events involving the pair (X, Y). For example, define the set of ordered 
pairs C = {(x, y):x => 115 and y > 110} so that that the event {(X, Y) €C)}=ANB. 
That is, the event that the pair of random variables lies in the set C is the same 
as the intersection of the two events A and B. In Example 1.5.4, we computed 
Pr(A NM B) = 0.1198. So, we can now assert that Pr((X, Y) € C) = 0.1198. <I 


Joint/Bivariate Distribution. Let X and Y be random variables. The joint distribution 
or bivariate distribution of X and Y is the collection of all probabilities of the form 
Pr[(X, Y) € C] for all sets C of pairs of real numbers such that {(X, Y) € C} is an event. 


It is a straightforward consequence of the definition of the joint distribution of X and 
Y that this joint distribution is itself a probability measure on the set of ordered pairs 
of real numbers. The set {(X, Y) € C} will be an event for every set C of pairs of real 
numbers that most readers will be able to imagine. 

In this section and the next two sections, we shall discuss convenient ways to 
characterize and do computations with bivariate distributions. In Sec. 3.7, these 
considerations will be extended to the joint distribution of an arbitrary finite number 
of random variables. 


Discrete Joint Distributions 


Theater Patrons. Suppose that a sample of 10 people is selected at random from a 
theater with 200 patrons. One random variable of interest might be the number X 
of people in the sample who are over 60 years of age, and another random variable 
might be the number Y of people in the sample who live more than 25 miles from 
the theater. For each ordered pair (x, y) with x =0,...,10 and y=0,..., 10, we 
might wish to compute Pr((X, Y) = (x, y)), the probability that there are x people in 
the sample who are over 60 years of age and there are y people in the sample who 
live more than 25 miles away. <l 


Discrete Joint Distribution. Let X and Y be random variables, and consider the ordered 
pair (X, Y). If there are only finitely or at most countably many different possible 
values (x, y) for the pair (X, Y), then we say that X and Y have a discrete joint 
distribution. 
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The two random variables in Example 3.4.2 have a discrete joint distribution. 


Suppose that two random variables X and Y each have a discrete distribution. Then 
X and Y have a discrete joint distribution. 


Proof If both X and Y have only finitely many possible values, then there will be 
only a finite number of different possible values (x, y) for the pair (X, Y). On the 
other hand, if either X or Y or both can take a countably infinite number of possible 
values, then there will also be a countably infinite number of possible values for the 
pair (X, Y). In all of these cases, the pair (X, Y) has a discrete joint distribution. 


When we define continuous joint distribution shortly, we shall see that the obvious 
analog of Theorem 3.4.1 is not true. 


Joint Probability Function, p.f. The joint probability function, or the joint p.f, of X and 
Y is defined as the function f such that for every point (x, y) in the xy-plane, 


F(x, y) =Pr(X =x and Y=y). 


The following result is easy to prove because there are at most countably many 
pairs (x, y) that must account for all of the probability a discrete joint distribution. 


Let X and Y have a discrete joint distribution. If (x, y) is not one of the possible 
values of the pair (X, Y), then f(x, y) =0. Also, 


> f@,y=1. 


All (x,y) 


Finally, for each set C of ordered pairs, 


Pr[(x,YyecC]= S° fay). 7 


(x, y)EC 


Specifying a Discrete Joint Distribution by a Table of Probabilities. In a certain suburban 
area, each household reported the number of cars and the number of television sets 
that they owned. Let X stand for the number of cars owned by a randomly selected 
household in this area. Let Y stand for the number of television sets owned by that 
same randomly selected household. In this case, X takes only the values 1, 2, and 3; 
Y takes only the values 1, 2, 3, and 4; and the joint p.f. f of X and Y is as specified in 
Table 3.2. 


Table 3.2 Joint p.f. f(x, y) for 
Example 3.4.3 
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Figure 3.10 The joint p.f. of 
X and Y in Example 3.4.3. 
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f@ y) 


This joint p.f. is sketched in Fig. 3.10. We shall determine the probability that 
the randomly selected household owns at least two of both cars and televisions. In 
symbols, this is Pr(X > 2 and Y > 2). 

By summing f(x, y) over all values of x > 2 and y > 2, we obtain the value 


Pr(X >2and Y > 2) = f(2, 2) + f(2, 3) + f(2, 4) + f@, 2) 
+ £3,3)+ fG,4 
= 0.5. 
Next, we shall determine the probability that the randomly selected household owns 


exactly one car, namely Pr(X = 1). By summing the probabilities in the first row of 
the table, we obtain the value 


4 


Pr(X =1)= > fd, y) =0.2. < 


y=l 


Continuous Joint Distributions 


Demands for Utilities. Consider again the joint distribution of X and Y in Exam- 
ple 3.4.1. When we first calculated probabilities for these two random variables back 
in Example 1.5.4 on page 19 (even before we named them or called them random 
variables), we assumed that the probability of each subset of the sample space was 
proportional to the area of the subset. Since the area of the sample space is 29,204, 
the probability that the pair (X, Y) lies in a region C is the area of C divided by 29,204. 
We can also write this relation as 


1 
Pr((X, Y - ———d 3.4.1 
r((X, Y) eC} Ll om x dy, (3.4.1) 


assuming that the integral exists. 4 


If one looks carefully at Eq. (3.4.1), one will notice the similarity to Eqs. (3.2.2) 
and (3.2.1). We formalize this connection as follows. 


Continuous Joint Distribution/Joint p.d.f./Support. Two random variables X and Y have 


a continuous joint distribution if there exists a nonnegative function f defined over 
the entire xy-plane such that for every subset C of the plane, 


Pr[(X, vecl= | f(x, y) dx dy, 


Example 
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Figure 3.11 An example of 
a joint p.d.f. 
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if the integral exists. The function f is called the joint probability density function 
(abbreviated joint p.d.f.) of X and Y. The closure of the set {(x, y): f(x, y) > O} is 
called the support of (the distribution of) (X, Y). 


Demands for Utilities. In Example 3.4.4, it is clear from Eq. (3.4.1) that the joint p.d.f. 
of X and ¥ is the function 


1 
— for4<x < 200 and1< y < 150, 
f(x, y) =} 29,204 aie == (3.4.2) 


0 otherwise. < 


It is clear from Definition 3.4.4 that the joint p.d.f. of two random variables 
characterizes their joint distribution. The following result is also straightforward. 


A joint p.d.f. must satisfy the following two conditions: 


f(x, y)=>0 for —coo<x<oand -—w<y<wo, 


[ [fe navay=1. a 


Any function that satisfies the two displayed formulas in Theorem 3.4.3 is the joint 
p.d.f. for some probability distribution. 

An example of the graph of a joint p.d.f. is presented in Fig. 3.11. 

The total volume beneath the surface z = f(x, y) and above the xy-plane must be 
1. The probability that the pair (X, Y) will belong to the rectangle C is equal to the 
volume of the solid figure with base A shown in Fig. 3.11. The top of this solid figure 
is formed by the surface z = f(x, y). 

In Sec. 3.5, we will show that if X and Y have a continuous joint distribution, 
then X and Y each have a continuous distribution when considered separately. This 
seems reasonable intutively. However, the converse of this statement is not true, and 
the following result helps to show why. 


and 


122 


Chapter 3 Random Variables and Distributions 


Theorem 
3.4.4 


Example 
3.4.6 


Example 
3.4.7 


Example 
3.4.8 


For every continuous joint distribution on the x y-plane, the following two statements 
hold: 


i. Every individual point, and every infinite sequence of points, in the xy-plane 
has probability 0. 


ii. Let f be a continuous function of one real variable defined on a (possibly 
unbounded) interval (a, b). The sets {(x, y): y= f(x), a < x < b} and {(x, y): 
x = f(y), a < y <b} have probability 0. 


Proof According to Definition 3.4.4, the probability that a continuous joint distri- 
bution assigns to a specified region of the xy-plane can be found by integrating the 
joint p.d.f. f(x, y) over that region, if the integral exists. If the region is a single point, 
the integral will be 0. By Axiom 3 of probability, the probability for any countable 
collection of points must also be 0. The integral of a function of two variables over 
the graph of a continuous function in the xy-plane is also 0. a 


Not a Continuous Joint Distribution. It follows from (ii) of Theorem 3.4.4 that the 
probability that (X, Y) will lie on each specified straight line in the plane is 0. If 
X has a continuous distribution and if Y = X, then both X and Y have continuous 
distributions, but the probability is 1 that (X, Y) lies on the straight line y = x. Hence, 
X and Y cannot have a continuous joint distribution. <l 


Calculating a Normalizing Constant. Suppose that the joint p.d-f. of X and Y is specified 
as follows: 
Awe | cxy for x2 < y <1, 
0 otherwise. 
We shall determine the value of the constant c. 
The support S of (X, Y) is sketched in Fig. 3.12. Since f(x, y) =0 outside S, it 


follows that 
/ / fis. ydxdy= ff fo. yydxdy 
—oo J—00 Ss 


1 1 4 
=I / cx*y dy dx = —c. 
—1 J x2 21 


Since the value of this integral must be 1, the value of c must be 21/4. 

The limits of integration on the last integral in (3.4.3) were determined as follows. 
We have our choice of whether to integrate x or y as the inner integral, and we chose 
y. So, we must find, for each x, the interval of y values over which to integrate. From 
Fig. 3.12, we see that, for each x, y runs from the curve where y = x” to the line 
where y = 1. The interval of x values for the outer integral is from —1 to 1 according 
to Fig. 3.12. If we had chosen to integrate x on the inside, then for each y, we see that 
x runs from —,/y to ,/y, while y runs from 0 to 1. The final answer would have been 
the same. J 


(3.4.3) 


Calculating Probabilities from a Joint p.d.f. For the joint distribution in Example 3.4.7, 
we shall now determine the value of Pr(X > Y). 
The subset So of S where x > y is sketched in Fig. 3.13. Hence, 


o fF Os 3 
pixery=f f fo.sdeay= fo | —x"“ydydx=—. < 
So 0 Jx2 4 20 
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tv 


Figure 3.12 The support S$ 


yA y=x 
of (X, Y) in Example 3.4.8. 


(1,1) d, 1) 


Figure 3.13 The subset So 
of the support S where x > y 
in Example 3.4.8. 


Example Determining a Joint p.d.f. by Geometric Methods. Suppose that a point (X, Y) is se- 
3.4.9 lected at random from inside the circle x” + y* <9. We shall determine the joint 
p.d.f. of X and Y. 
The support of (X, Y) is the set S of points on and inside the circle x? + y” <9. 
The statement that the point (X, Y) is selected at random from inside the circle is 
interpreted to mean that the joint p.d.f. of X and Y is constant over S and is 0 outside S. 
Thus, 


for y=" for (x, y) €S, 


0 otherwise. 


We must have 
[ [ tenardy=e x (area of S) = 1. 
S 


Since the area of the circle S is 97, the value of the constant c must be 1/(97). 


Mixed Bivariate Distributions 


Example A Clinical Trial. Consider a clinical trial (such as the one described in Example 2.1.12) 
3.4.10 in which each patient with depression receives a treatment and is followed to see 
whether they have a relapse into depression. Let X be the indicator of whether or 

not the first patient is a “success” (no relapse). That is X = 1 if the patient does not 

relapse and X = 0 if the patient relapses. Also, let P be the proportion of patients 

who have no replapse among all patients who might receive the treatment. It is clear 

that X must have a discrete distribution, but it might be sensible to think of P as 

a continuous random variable taking its value anywhere in the interval [0, 1]. Even 

though X and P can have neither a joint discrete distribution nor a joint continuous 

distribution, we can still be interested in the joint distribution of X and P. < 
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Prior to Example 3.4.10, we have discussed bivariate distributions that were 
either discrete or continuous. Occasionally, one must consider a mixed bivariate dis- 
tribution in which one of the random variables is discrete and the other is continuous. 
We shall use a function f(x, y) to characterize such a joint distribution in much the 
same way that we use a joint p.f. to characterize a discrete joint distribution or a joint 
p.d.f. to characterize a continuous joint distribution. 


Joint p.f./p.d.f. Let X and Y be random variables such that X is discrete and Y is 
continuous. Suppose that there is a function f(x, y) defined on the xy-plane such 
that, for every pair A and B of subsets of the real numbers, 


Pr(X € Aand Y € B) =I oa f(x, y)dy, (3.4.4) 


B xeEA 


if the integral exists. Then the function f is called the joint p.f/p.d.f of X and Y. 


Clearly, Definition 3.4.5 can be modified in an obvious way if Y is discrete and X 
is continuous. Every joint p.f./p.d.f. must satisfy two conditions. If X is the discrete 


random variable with possible values x1, x.,... and Y is the continuous random 
variable, then f(x, y) > 0 for all x, y and 
co CO 
i Y> fj. ydy =1. (3.4.5) 
—0o 


i=l 
Because f is nonnegative, the sum and integral in Eqs. (3.4.4) and (3.4.5) can be done 
in whichever order is more convenient. 


Note: Probabilities of More General Sets. For a general set C of pairs of real 
numbers, we can compute Pr((X, Y) € C) using the joint p.f/p.d.f. of X and Y. For 
each x, let C, = {y: (x, y) € C}. Then 


Pr((X, Y) €C) = md f(x, y)dy, 


All x 


if all of the integrals exist. Alternatively, for each y, define CY = {x : (x, y) € C}, and 
then 


Pr((X, Y) coy= | [> f(x, >) dy, 


© Lyeecy 


if the integral exists. 


A Joint p.f./p.d.f. Suppose that the joint p.f./p.d.f. of X and Y is 


fe y=, for x =1,2,3and0<y<1. 
We should check to make sure that this function satisfies (3.4.5). It is easier to 
integrate over the y values first, so we compute 
x—1 


3 1 yy 3 1 
ay, ; dy=)i3=1 


x=1 


Suppose that we wish to compute the probability that Y > 1/2 and X > 2. That is, we 
want Pr(X € A and Y € B) with A = [2, 00) and B = [1/2, 00). So, we apply Eq. (3.4.4) 
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to get the probability 
3 it x—1 3 _ x 
>| 2 dy =) (—o) = 0.5417. 
x=2 "1/2 3 x=2 3 


For illustration, we shall compute the sum and integral in the other order also. 
For each y € [1/2, ), . 3 f(x, y) =2y/3+ y*. For y > 1/2, the sum is 0. So, the 
probability is 


[, Eeeeal ay= 3 E (3) | +5 E (3) | = 0.5417. < 


A Clinical Trial. A possible joint p.f./p.d-f. for X and P in Example 3.4.10 is 


f(x, p)=p*(— p)'*, forx=0,1land0 <p <1. 


Here, X is discrete and P is continuous. The function f is nonnegative, and the 
reader should be able to demonstrate that it satisfies (3.4.5). Suppose that we wish 
to compute Pr(X <0 and P < 1/2). This can be computed as 


1/2 1 3 
/ (= p)dp = -=[0 - 1/2)? - 0] =. 
0 2 8 


Suppose that we also wish to compute Pr(X = 1). This time, we apply Eq. (3.4.4) with 
A= {1} and B = (0, 1). In this case, 


1 
prx= t= f ne. <1 
0 2 


A more complicated type of joint distribution can also arise in a practical prob- 
lem. 


A Complicated Joint Distribution. Suppose that X and Y are the times at which two 
specific components in an electronic system fail. There might be a certain probability 
p (0 < p <1) that the two components will fail at the same time and a certain 
probability 1 — p that they will fail at different times. Furthermore, if they fail at 
the same time, then their common failure time might be distributed according to a 
certain p.d.f. f(x); if they fail at different times, then these times might be distributed 
according to a certain joint p.d.f. g(x, y). 

The joint distribution of X and Y in this example is not continuous, because 
there is positive probability p that (X, Y) will lie on the line x = y. Nor does the joint 
distribution have a joint p.f/p.d.f. or any other simple function to describe it. There 
are ways to deal with such joint distributions, but we shall not discuss them in this 
text. < 


Bivariate Cumulative Distribution Functions 


The first calculation in Example 3.4.12, namely, Pr(X <0 and Y < 1/2), is a gener- 
alization of the calculation of a c.d.f. to a bivariate distribution. We formalize the 
generalization as follows. 


Joint (Cumulative) Distribution Function/c.d.f. The joint distribution function or joint 
cumulative distribution function (joint c.d.f.) of two random variables X and Y is 
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Figure 3.14 The probability 


of a rectangle. 
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defined as the function F such that for all values of x and y (—oo < x < co and —oo < 
y <0), 


F(x, y)=Pr(X <x andY <y). 


It is clear from Definition 3.4.6 that F(x, y) is monotone increasing in x for each fixed 
y and is monotone increasing in y for each fixed x. 

If the joint c.d.f. of two arbitrary random variables X and Y is F, then the 
probability that the pair (X, Y) will lie in a specified rectangle in the xy-plane can be 
found from F as follows: For given numbers a < b andc <d, 


Pr(a < X <bandc < Y <d) 
= Pr(a < X <band Y <d) — Pr(a < X <bandY <c) 
= [Pr(X <band Y <d) — Pr(X <aandY <a)] (3.4.6) 
—[Pr(X <band Y <c) —Pr(X <aandY <c)] 
= F(b, d) — F(a, d) — F(b, c) + Fa, c). 


Hence, the probability of the rectangle C sketched in Fig. 3.14 is given by the 
combination of values of F just derived. It should be noted that two sides of the 
rectangle are included in the set C and the other two sides are excluded. Thus, if there 
are points or line segments on the boundary of C that have positive probability, it is 
important to distinguish between the weak inequalities and the strict inequalities in 
Eq. (3.4.6). 


Let X and Y have a joint c.d-f. F. The c.d-f. F, of just the single random variable X 
can be derived from the joint c.d.f. F as F\(x) = lim,_,,.. F(x, y). Similarly, the c.d.f. 
F, of Y equals F>(y) = lim,_,,, F(x, y), for 0 < y < oo. 


Proof We prove the claim about F, as the claim about F% is similar. Let —oo < x < 00. 
Define 


Bo ={X <x and Y <0}, 


B, ={X <xandn—-1<Y<n}, forn=1,2,..., 
m 

An =|) By, form=1,2,.... 
n=0 


Then {X <x}=U _) B,, and A,, ={X <x and Y <m} form =1, 2, .... It follows 
that Pr(A,,) = F(x, m) for each m. Also, 
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F\(x) =Pr(X <x) =Pr (U *) 


n=1 


CO 
= > Pr(s,)= lim Pr(A,,) 
m—>oo 
n=0 
= lim F(x,m)= lim F(x, y), 
m—>co yoo 


where the third equality follows from countable additivity and the fact that the B, 
events are disjoint, and the last equality follows from the fact that F(x, y) ismonotone 
increasing in y for each fixed x. rT] 


Other relationships involving the univariate distribution of X, the univariate distri- 
bution of Y, and their joint bivariate distribution will be presented in the next section. 

Finally, if X and Y have a continuous joint distribution with joint p.d.f. f, then 
the joint c.d.-f. at (x, y) is 


Fay a f(r, s) dr ds. 


Here, the symbols r and s are used simply as dummy variables of integration. The 
joint p.d.f. can be derived from the joint c.d.f. by using the relations 


02F Xx 0°F Xs 
fayj= 2 
axdy dyox 


at every point (x, y) at which these second-order derivatives exist. 


Determining a Joint p.d.f. from aJointc.d.f. Suppose that X and Y are random variables 
that take values only in the intervals 0 < X <2 and 0 < Y <2. Suppose also that the 
joint c.d.f. of X and Y, for 0 <x <2 and0 < y <2, is as follows: 


1 
F(x, y)= 160 + y). (3.4.7) 


We shall first determine the c.d.f. F, of just the random variable X and then determine 
the joint p.d.f. f of X and Y. 

The value of F(x, y) at any point (x, y) in the xy-plane that does not represent 
a pair of possible values of X and Y can be calculated from (3.4.7) and the fact that 
F(x, y) =Pr(X <x and Y < y). Thus, if either x <Oor y <0, then F(x, y) =0. If both 
x >2and y>2,then F(x, y)=11f0<x <2and y >2,then F(x, y) = F(x, 2), and 
it follows from Eq. (3.4.7) that 


1 
F(x, y) = exe +2), 
Similarly, if 0 < y <2 and x > 2, then 
1 
F(x, y= gt: 


The function F(x, y) has now been specified for every point in the xy-plane. 
By letting y > ov, we find that the c.d.f. of just the random variable X is 
0 for x <0, 
F(x) =} gx(x +2) for0<x <2, 


1 for x >2. 
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Furthermore, for 0 < x <2 and0 < y <2, 


a? F (x, y) il 


= -(x+y). 
axdy 3 1» 


Also, if x <0, y < 0, x > 2, or y > 2, then 


PF, ¥) _ 9 
axdy 


Hence, the joint p.d.f. of X and Y is 


fe ) ca forO<x <2and0<y <2, 
x, y= 
0 


otherwise. < 


Demands for Utilities. We can compute the joint c.d.f. for water and electric demand 
in Example 3.4.4 by using the joint p.d.f. that was given in Eq. (3.4.2). Ifeither x < 4or 
y <1, then F(x, y) = 0 because either X < x or Y < y would be impossible. Similarly, 
if both x > 200 and y > 150, F(x, y) = 1 because both X < x and Y < y would be sure 
events. For other values of x and y, we compute 


ae (AU xy 
i / ——dydx = for 4 < x < 200, 1< y < 150, 
4 J1 29,204 29,204 


x 150 
x 
Fone 2 pe ted ep 25h 150, 
ey) i} y 29,204°>“* ~ 196 eet 


200 py 4 i 
i / —— dydx = — for x > 200, 1 < y < 150. 
4 1 29,204 149 


The reason that we need three cases in the formula for F(x, y) is that the joint p.d.f. 
in Eq. (3.4.2) drops to 0 when x crosses above 200 or when y crosses above 150; 
hence, we never want to integrate 1/29,204 beyond x = 200 or beyond y = 150. If 
one takes the limit as y > oo of F(x, y) (for fixed 4 < x < 200), one gets the second 
case in the formula above, which then is the c.d.f. of X, F,(x). Similarly, if one takes 
the lim,_,,, F(x, y) (for fixed 1 < y < 150), one gets the third case in the formula, 
which then is the c.d.f. of Y, Fy(y). < 


Summary 


The joint c.d.-f. of two random variables X and Y is F(x, y) = Pr(X <x and Y < y). 
The joint p.d.f. of two continuous random variables is a nonnegative function f such 
that the probability of the pair (X, Y) being ina set C is the integral of f(x, y) over the 
set C, if the integral exists. The joint p.d.f. is also the second mixed partial derivative 
of the joint c.d.f. with respect to both variables. The joint p.f. of two discrete random 
variables is anonnegative function f such that the probability of the pair (X, Y) being 
inaset C is the sum of f(x, y) over all points in C. A joint p.f. can be strictly positive at 
countably many pairs (x, y) at most. The joint p.f./p.d.f. of a discrete random variable 
X and a continuous random variable Y is a nonnegative function f such that the 
probability of the pair (X, Y) being in a set C is obtained by summing f(x, y) over 
all x such that (x, y) € C for each y and then integrating the resulting function of y. 


Exercises 


1. Suppose that the joint p.d.f. of a pair of random vari- 
ables (X, Y) is constant on the rectangle where 0 < x <2 
and 0 < y <1, and suppose that the p.d.f. is 0 off of this 
rectangle. 


a. Find the constant value of the p.d.f. on the rectangle. 
b. Find Pr(x > Y). 


2. Suppose that in an electric display sign there are three 
light bulbs in the first row and four light bulbs in the second 
row. Let X denote the number of bulbs in the first row that 
will be burned out at a specified time r, and let Y denote 
the number of bulbs in the second row that will be burned 
out at the same time fr. Suppose that the joint p.f. of X and 
Y is as specified in the following table: 


xX 0 il 2 3 4 

0 0.08 0.07 0.06 0.01 0.01 
1 0.06 0.10 0.12 0.05 0.02 
2 0.05 0.06 0.09 0.04 0.03 
3 0.02 0.03 0.03 0.03 0.04 


Determine each of the following probabilities: 


a. Pr(xX =2) b. Pr(Y > 2) 
ec Pr(Xx <2 and Y <2) d. Pr(x =Y) 
e. Pr(Xx > Y) 


3. Suppose that X and Y have a discrete joint distribution 
for which the joint p.f. is defined as follows: 


clx+y| for x =—2, —1, 0,1, 2 and 
ff, y= y=-2, —1, 0, 1, 2, 
0 otherwise. 


Determine (a) the value of the constant c; (b) Pr(X = 
0 and Y = —2); (ce) Pr(X = 1); (d) Pr(|X — Y| <1). 


4. Suppose that X and Y have a continuous joint distribu- 
tion for which the joint p.d-f. is defined as follows: 


oe cy? for0<x <2and0<y<1, 
, 0 otherwise. 


Determine (a) the value of the constant c; (b) Pr(X + Y > 
2); (ec) Pr(Y < 1/2); (d) Pr(X < 1); (e) Prix = 3Y). 


5. Suppose that the joint p.d.f. of two random variables X 
and Y is as follows: 


fa, y= one for0<y<1—2", 
; 0 otherwise. 
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Determine (a) the value of the constant c; 
(b) Pr(O < X < 1/2); (ec) Pr(Y < X +1); 
(d) Pr(Y = X?). 


6. Suppose that a point (X, Y) is chosen at random from 
the region S$ in the xy-plane containing all points (x, y) 
such that x > 0, y > 0, and 4y +x <4. 


a. Determine the joint p.d.f. of X and Y. 


b. Suppose that Sp is a subset of the region S having area 
a and determine Pr[(X, Y) € So]. 


7. Suppose that a point (X, Y) is to be chosen from the 
square S in the xy-plane containing all points (x, y) such 
that 0 <x <1 and 0 < y <1. Suppose that the probabil- 
ity that the chosen point will be the corner (0, 0) is 0.1, 
the probability that it will be the corner (1, 0) is 0.2, the 
probability that it will be the corner (0, 1) is 0.4, and the 
probability that it will be the corner (1, 1) is 0.1. Suppose 
also that if the chosen point is not one of the four cor- 
ners of the square, then it will be an interior point of the 
square and will be chosen according to a constant p.d.f. 
over the interior of the square. Determine (a) Pr(X < 1/4) 
and (b) Pr(X + Y <1). 


8. Suppose that X and Y are random variables such that 
(X, Y) must belong to the rectangle in the xy-plane con- 
taining all points (x, y) for which O < x <3 and0<y <4. 
Suppose also that the joint c.d.f. of X and Y at every point 
(x, y) in this rectangle is specified as follows: 


1 
F(x, y)= Teg’ +y). 


Determine (a) Pr(1 < X <2 and1< Y <2); 
(b) Pr(2 < X <4 and2 <Y <4); (c) the c.d.f. of Y; 
(d) the joint p.d.f. of X and Y; (e) Pr(Y < X). 


9. In Example 3.4.5, compute the probability that water 
demand X is greater than electric demand Y. 


10. Let Y be the rate (calls per hour) at which calls arrive 
at a switchboard. Let X be the number of calls during a 
two-hour period. A popular choice of joint p.f./p.d.f. for 
(X, Y) in this example would be one like 


(2y)* ,-3y 


f(x »=| —e"Y ify>Oandx=0,1,..., 
0 


otherwise. 


a. Verify that f is a joint p.f./p.d.f. Hint: First, sum over 
the x values using the well-known formula for the 
power series expansion of e””. 


b. Find Pr(X = 0). 
11. Consider the clinical trial of depression drugs in Ex- 


ample 2.1.4. Suppose that a patient is selected at random 
from the 150 patients in that study and we record Y, an 
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Table 3.3 Proportions in clinical depression study for Exercise 11 


Treatment group (Y) 
Response (X) Imipramine (1) Lithium (2) Combination (3) Placebo (4) 


Relapse (0) 0.120 0.087 0.146 0.160 
No relapse (1) 0.147 0.166 0.107 0.067 
indicator of the treatment group for that patient, and X, an or in combination with Imipramine) and did not re- 
indicator of whether or not the patient relapsed. Table 3.3 lapse. 
contains the joint p.f. of X and Y. b. Calculate the probability that the patient had a re- 
a. Calculate the probability that a patient selected at lapse (without regard to the treatment group). 


random from this study used Lithium (either alone 


Example 
3.5.1 


3.5 Marginal Distributions 


Earlier in this chapter, we introduced distributions for random variables, and 
in Sec. 3.4 we discussed a generalization to joint distributions of two random 
variables simultaneously. Often, we start with a joint distribution of two random 
variables and we then want to find the distribution of just one of them. The 
distribution of one random variable X computed from a joint distribution is also 
called the marginal distribution of X. Each random variable will have a marginal 
c.d.f. as wellas amarginal p.d.f. or p.f We also introduce the concept of independent 
random variables, which is a natural generalization of independent events. 


Deriving a Marginal p.f. or a Marginal p.d.f. 


We have seen in Theorem 3.4.5 that if the joint c.d.f. F of two random variables X 
and Y is known, then the c.d.f. F, of the random variable X can be derived from 
F. We saw an example of this derivation in Example 3.4.15. If X has a continuous 
distribution, we can also derive the p.d.f. of X from the joint distribution. 


Demands for Utilities. Look carefully at the formula for F(x, y) in Example 3.4.15, 
specifically the last two branches that we identified as F,(x) and F>(y), the c.d.f’s of 
the two individual random variables X and Y. It is apparent from those two formulas 
and Theorem 3.3.5 that the p.d.f. of X alone is 


1 
— for4<x < 200, 
fix) = 4 196 ~~ 
0 otherwise, 


which matches what we already found in Example 3.2.1. Similarly, the p.d.f. of Y 
alone is 


1 
— forl<y< 150, 
fly) =} 149 ie 


0 otherwise. < 


The ideas employed in Example 3.5.1 lead to the following definition. 


Figure 3.15 Computing 
f,@) from the joint p.f. 


Definition 
3.5.1 


Theorem 
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Marginal c.d.f./p.f./p.d.f. Suppose that X and Y have a joint distribution. The c.d-f. of 
X derived by Theorem 3.4.5 is called the marginal c.d.f. of X. Similarly, the p.f. or p.d.f. 
of X associated with the marginal c.d.f. of X is called the marginal p.f: or marginal 
p.d.f of X. 


To obtain a specific formula for the marginal p.f. or marginal p.d.f., we start with 
a discrete joint distribution. 


If X and Y have a discrete joint distribution for which the joint p.f. is f, then the 
marginal p.f. f; of X is 


f= > fe (3.5.1) 


All y 


Similarly, the marginal p.f. f, of Y is fo(y) = ay» £@, y)- 


Proof We prove the result for f;, as the proof for f> is similar. We illustrate the 
proof in Fig. 3.15. In that figure, the set of points in the dashed box is the set of 
pairs with first coordinate x. The event {X = x} can be expressed as the union of the 
events represented by the pairs in the dashed box, namely, B, = {X =x and Y = y} 
for all possible y. The B, events are disjoint and Pr(B,) = f (x, y). Since Pr(X =x) = 
Yan y Pr(B,), Eq. (3.5.1) holds. 7 


Deriving a Marginal p.f. from a Table of Probabilities. Suppose that X and Y are the 
random variables in Example 3.4.3 on page 119. These are respectively the numbers 
of cars and televisions owned by a radomly selected household in a certain suburban 
area. Table 3.2 on page 119 gives their joint p.f., and we repeat that table in Table 3.4 
together with row and column totals added to the margins. 

The marginal p.f. f; of X can be read from the row totals of Table 3.4. The 
numbers were obtained by summing the values in each row of this table from the four 
columns in the central part of the table (those labeled y = 1, 2, 3, 4). In this way, it is 
found that f;(1) = 0.2, f,(2) = 0.6, f,(3) = 0.2, and f,(x) = 0 for all other values of x. 
This marginal p.f. gives the probabilities that a randomly selected household owns 
1, 2, or 3 cars. Similarly, the marginal p.f. f5 of Y, the probabilities that a household 
owns 1, 2, 3, or 4 televisions, can be read from the column totals. These numbers were 
obtained by adding the numbers in each of the columns from the three rows in the 
central part of the table (those labeled x = 1, 2, 3.) 4 


The name marginal distribution derives from the fact that the marginal distribu- 
tions are the totals that appear in the margins of tables like Table 3.4. 

If X and Y have a continuous joint distribution for which the joint p.d.f. is f, then 
the marginal p.d.f. f; of X is again determined in the manner shown in Eq. (3.5.1), but 
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Table 3.4 Joint pf. f(x, y) with marginal 
p.f’s for Example 3.5.2 


y 
x 1 2 3 4 Total 
1 0.1 0 0.1 0 0.2 
2 0.3 0 0.1 0.2 0.6 
3 0 0.2 0 0 0.2 


Total 0.4 0.2 0.2 0.2 1.0 


the sum over all possible values of Y is now replaced by the integral over all possible 
values of Y. 


If X and Y have a continuous joint distribution with joint p.d-f. f, then the marginal 
p.d.f. f; of X is 


Ai) = [- f(x, y)dy for —co<x <oo. (3.5.2) 


Similarly, the marginal p.d-f. f5 of Y is 
lo-e) 
fa) =) f(x, y)dx for —co<y<oo. (3.5.3) 
—0CO 


Proof We prove (3.5.2) as the proof of (3.5.3) is similar. For each x, Pr(X < x) can be 
written as Pr((X, Y) € C), where C = {(r, 5)) :r <x}. We can compute this probability 
directly from the joint p.d-f. of X and Y as 


Pr((x, Y) eC) = a 7 f(r, s)dsdr 


=f lf ta, sds dr 


The inner integral in the last expression of Eq. (3.5.4) is a function of r and it 
can easily be recognized as f,(r), where f; is defined in Eq. (3.5.2). It follows that 
Pr(X <x)= qo fi@)dr, so f, is the marginal p.d-f. of X. | 


(3.5.4) 


Deriving a Marginal p.d.f. Suppose that the joint p.d.f. of X and Y is as specified in 
Example 3.4.8, namely, 


f3) | axy for x? <y <1, 
XxX, y — 
0 otherwise. 


The set S of points (x, y) for which f(x, y) > 0 is sketched in Fig. 3.16. We shall 
determine first the marginal p.d.f. f, of X and then the marginal p.d-f. fo of Y. 

It can be seen from Fig. 3.16 that X cannot take any value outside the interval 
[—1, 1]. Therefore, f;(x) =0 for x < —1 or x > 1. Furthermore, for —1 < x <1, it is 
seen from Fig. 3.16 that f(x, y) = 0 unless x2< y <1. Therefore, for —1 <x <1, 


ee) 1 
fix) = f(x, y)dy= [. (7) xy dy = (2) x2(1 — x’). 


Figure 3.16 The set S where 
f(, y) > 0 in Example 3.5.3. 


Figure 3.17 The marginal 
p.d.f. of X in Example 3.5.3. 


Figure 3.18 The marginal 
p.d.f. of Y in Example 3.5.3. 


Theorem 
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This marginal p.d.f. of X is sketched in Fig. 3.17. 

Next, it can be seen from Fig. 3.16 that Y cannot take any value outside the 
interval [0, 1]. Therefore, f.(y) = 0 for y <Oor y > 1. Furthermore, for 0 < y <1, it 
is seen from Fig. 3.12 that f(x, y) =0 unless —,/y < x < ,/y. Therefore, for0 < y <1, 


o V5 
ene / sidee 7 (7) Pyde= (5) yi, 
68 ayy 4 2 


This marginal p.d.f. of Y is sketched in Fig. 3.18. 4 


If X has a discrete distribution and Y has a continuous distribution, we can derive 
the marginal p.f. of X and the marginal p.d.f. of Y from the joint p.f./p.d-f. in the same 
ways that we derived a marginal p.f. or a marginal p.d.f. from a joint p.f. or a joint 
p.d.f. The following result can be proven by combining the techniques used in the 
proofs of Theorems 3.5.1 and 3.5.2. 


Let f be the joint p.f./p.d-f. of X and Y, with X discrete and Y continuous. Then the 
marginal p.f. of X is 


fix) = Prix =x) = i. F(x, y) dy, forall x, 
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and the marginal p.d-f. of Y is 


f(y) = > f(x,y), for —co<y<oo. » 


Determining a Marginal p.f. and Marginal p.d.f. from a Joint p.f./p.d.f. Suppose that the 
joint p.f./p.d.f. of X and Y is as in Example 3.4.11 on page 124. The marginal p.f. of X 
is obtained by integrating 


1 x—1 
xy 1 
— dy=-, 
AG) | 3 = 3 


for x = 1, 2, 3. The marginal p.d-f. of Y is obtained by summing 


Aoy=z+ 2497, for0<y<1. < 
Although the marginal distributions of X and Y can be derived from their 
joint distribution, it is not possible to reconstruct the joint distribution of X and 
Y from their marginal distributions without additional information. For instance, 
the marginal p.d.f.’s sketched in Figs. 3.17 and 3.18 reveal no information about the 
relationship between X and Y. In fact, by definition, the marginal distribution of 
X specifies probabilities for X without regard for the values of any other random 
variables. This property of a marginal p.d-f. can be further illustrated by another 
example. 


Marginal and Joint Distributions. Suppose that a penny and a nickel are each tossed n 
times so that every pair of sequences of tosses (1 tosses in each sequence) is equally 
likely to occur. Consider the following two definitions of X and Y: (i) X is the number 
of heads obtained with the penny, and Y is the number of heads obtained with the 
nickel. (ii) Both X and Y are the number of heads obtained with the penny, so the 
random variables X and Y are actually identical. 

In case (i), the marginal distribution of X and the marginal distribution of Y will 
be identical binomial distributions. The same pair of marginal distributions of X and 
Y will also be obtained in case (ii). However, the joint distribution of X and Y will 
not be the same in the two cases. In case (i), X and Y can take different values. Their 
joint p.f. is 

otherwise. 


x+y 
fe, y)= | OOG) ” reste Aocya 
0 
In case (ii), X and Y must take the same value, and their joint p.f. is 


ron=| 0 (8) fore = y= 0, 1.224%; P 
0 


otherwise. 


Independent Random Variables 
Demands for Utilities. In Examples 3.4.15 and 3.5.1, we found the marginal c.d.f’s of 
water and electric demand were, respectively, 

0 for x <4, 0 for y <1, 


Xx 
Fi@)=} qog fords 200, AY)= a6 for 1< y < 150, 


1 for x > 200, 1 for y > 150. 
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The product of these two functions is precisely the same as the joint c.d.f. of X and 
Y given in Example 3.5.1. One consequence of this fact is that, for every x and 
y, Pr(X <x, and Y < y) =Pr(X <x) Pr(Y < y). This equation makes X and Y an 
example of the next definition. < 


Independent Random Variables. It is said that two random variables X and Y are 
independent if, for every two sets A and B of real numbers such that {X € A} and 
{Y € B} are events, 


Pr(x ¢ Aand Y € B) =Pr(X € A) Pr(Y € B). (3.5.5) 


In other words, let E be any event the occurrence or nonoccurrence of which depends 
only on the value of X (such as E = {X € A}), and let D be any event the occurrence or 
nonoccurrence of which depends only on the value of Y (such as D = {Y € B}). Then 
X and Y are independent random variables if and only if E and D are independent 
events for all such events E and D. 

If X and Y are independent, then for all real numbers x and y, it must be true 
that 


Pr(X <x and Y < y)=Pr(X <x) Pr(Y <y). (3.5.6) 


Moreover, since all probabilities for X and Y of the type appearing in Eq. (3.5.5) can 
be derived from probabilities of the type appearing in Eq. (3.5.6), it can be shown that 
if Eq. (3.5.6) is satisfied for all values of x and y, then X and Y must be independent. 
The proof of this statement is beyond the scope of this book and is omitted, but we 
summarize it as the following theorem. 


Let the joint c.d.f. of X and Y be F, let the marginal c.d.f. of X be F, and let the 
marginal c.d.f. of Y be Fy. Then X and Y are independent if and only if, for all real 
numbers x and y, F(x, y) = Fy(x) Fy(y). a 


For example, the demands for water and electricity in Example 3.5.6 are independent. 
If one returns to Example 3.5.1, one also sees that the product of the marginal p.d.f.’s 
of water and electric demand equals their joint p.d.f. given in Eq. (3.4.2). This relation 
is characteristic of independent random variables whether discrete or continuous. 


Suppose that X and Y are random variables that have a joint p.f., p.d.f., or p.f./p.d-f f. 
Then X and Y will be independent if and only if f can be represented in the following 
form for —co < x < co and —@ < y < mw: 


Fx. y) =k hoy), (3.5.7) 


where /; is a nonnegative function of x alone and hz is a nonnegative function of y 
alone. 


Proof We shall give the proof only for the case in which X is discrete and Y is 
continuous. The other cases are similar. For the “if” part, assume that Eq. (3.5.7) 
holds. Write 


fix) -|/ hy(x)ho(y)dy = cyhy(x), 


where c; = i. hy(y)dy must be finite and strictly positive, otherwise f; wouldn’t be 
a p.f. So, hy(x) = f,(x)/cy. Similarly, 


1 il. 
Ao) = Do My w)ha(y) = ha(y) YO = AG) =~ m0). 
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3.5.1 


So, ho(y) = cy fo(y). Since f(x, y) = hy (x)A2(y), it follows that 
fi) 


Cl 


f@, y= hy) = fAMAg). (3.5.8) 


Now let A and B be sets of real numbers. Assuming the integrals exist, we can write 


Pr(X € AandY €B)=)> i‘ f(x, y)dy 
B 


xeA 


= [ Yo AM AO)dy, 


xEA 


=> Aw) [ frlsddy, 


xeEA 


where the first equality is from Definition 3.4.5, the second is from Eq. (3.5.8), and the 
third is straightforward rearrangement. We now see that X and Y are independent 
according to Definition 3.5.2. 

For the “only if” part, assume that X and Y are independent. Let A and B be sets 
of real numbers. Let f; be the marginal p.d.f. of X, and let f, be the marginal p.f. of 
Y. Then 


Pr(X eAandY €B)=)) fix) i frly)dy 
B 


xEA 


7 [ y- AG) Ady, 


xeEA 


(if the integral exists) where the first equality follows from Definition 3.5.2 and the 
second is a straightforward rearrangement. We now see that f(x) fo(y) satisfies the 
conditions needed to be f(x, y) as stated in Definition 3.4.5. = 


A simple corollary follows from Theorem 3.5.5. 


Two random variables X and Y are independent if and only if the following factor- 
ization is satisfied for all real numbers x and y: 


f@, vy) = fi@)fo). (3.5.9) 


As stated in Sec. 3.2 (see page 102), in a continuous distribution the values of a 
p.d.f. can be changed arbitrarily at any countable set of points. Therefore, for such a 
distribution it would be more precise to state that the random variables X and Y are 
independent if and only if it is possible to choose versions of f, f,, and f> such that 
Eq. (3.5.9) is satisfied for —0o < x < oo and —oo < y < ow. 


The Meaning of Independence We have given a mathematical definition of in- 
dependent random variables in Definition 3.5.2, but we have not yet given any in- 
terpretation of the concept of independent random variables. Because of the close 
connection between independent events and independent random variables, the in- 
terpretation of independent random variables should be closely related to the inter- 
pretation of independent events. We model two events as independent if learning 
that one of them occurs does not change the probability that the other one occurs. 
It is easiest to extend this idea to discrete random variables. Suppose that X and Y 
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Table 3.5 Joint p.f. f(x, y) for Example 3.5.7 


y 
x 1 2 3 4 5 6 Total 
0 1/24 1/24 1/24 1/24 1/24 1/24 1/4 
1 1/12 1/12 1/12 1/12 1/12 1/12 1/2 
2 1/24 1/24 1/24 1/24 1/24 1/24 1/4 
Total 1/46 1/6 1/6 1/6 1/6 1/6 1.000 


have a discrete joint distribution. If, for each y, learning that Y = y does not change 
any of the probabilities of the events {X = x}, we would like to say that X and Y are 
independent. From Corollary 3.5.1 and the definition of marginal p.f., we see that in- 
deed X and Y are independent if and only if, for each y and x such that Pr(Y = y) > 0, 
Pr(X =x|Y = y) = Pr(X = x), that is, learning the value of Y doesn’t change any of 
the probabilities associated with X. When we formally define conditional distribu- 
tions in Sec. 3.6, we shall see that this interpretation of independent discrete random 
variables extends to all bivariate distributions. In summary, if we are trying to decide 
whether or not to model two random variables X and Y as independent, we should 
think about whether we would change the distribution of X after we learned the value 
of Y or vice versa. 


Games of Chance. A carnival game consists of rolling a fair die, tossing a fair coin 
two times, and recording both outcomes. Let Y stand for the number on the die, 
and let X stand for the number of heads in the two tosses. It seems reasonable to 
believe that all of the events determined by the roll of the die are independent of all 
of the events determined by the flips of the coin. Hence, we can assume that X and Y 
are independent random variables. The marginal distribution of Y is the uniform 
distribution on the integers 1,...,6, while the distribution of X is the binomial 
distribution with parameters 2 and 1/2. The marginal p.f.’s and the joint p.f. of X 
and Y are given in Table 3.5, where the joint p.f. was constructed using Eq. (3.5.9). 
The Total column gives the marginal p.f. f; of X, and the Total row gives the marginal 
p.f. fo of Y. J 


Determining Whether Random Variables Are Independent in a Clinical Trial. Return to 
the clinical trial of depression drugs in Exercise 11 of Sec. 3.4 (on page 129). In that 
trial, a patient is selected at random from the 150 patients in the study and we record 
Y, anindicator of the treatment group for that patient, and X, an indicator of whether 
or not the patient relapsed. Table 3.6 repeats the joint p.f. of X and Y along with the 
marginal distributions in the margins. We shall determine whether or not X and Y 
are independent. 

In Eq. (3.5.9), f(x, y) is the probability in the xth row and the yth column of the 
table, f,(x) is the number in the Total column in the xth row, and f5(y) is the number 
in the Total row in the yth column. It is seen in the table that f(1, 2) = 0.087, while 
fiG@) = 0.513, and f,(1) = 0.253. Hence, f(1, 2) € f,() fo) = 0.129. It follows that 
X and ¥ are not independent. < 


It should be noted from Examples 3.5.7 and 3.5.8 that X and Y will be indepen- 
dent if and only if the rows of the table specifying their joint p.f. are proportional to 
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Table 3.6 Proportions marginals in Example 3.5.8 


Response (X) Imipramine (1) Lithium (2) Combination (3) Placebo (4) Total 


Treatment group (Y) 


Relapse (0) 0.120 0.087 0.146 0.160 0.513 
No relapse (1) 0.147 0.166 0.107 0.067 0.487 
Total 0.267 0.253 0.253 0.227 1.0 


Example 
3.5.9 


Figure 3.19 The subset Sp 
where x + y <1 
in Example 3.5.9. 


one another, or equivalently, if and only if the columns of the table are proportional 
to one another. 


Calculating a Probability Involving Independent Random Variables. Suppose that two 
measurements X and Y are made of the rainfall at a certain location on May 1 in two 
consecutive years. It might be reasonable, given knowledge of the history of rainfall 
on May 1, to treat the random variables X and Y as independent. Suppose that the 
p.d.f. g of each measurement is as follows: 


2x for0O<x <1, 
0 otherwise. 


g(x) = 


We shall determine the value of Pr(X + Y <1). 

Since X and Y are independent and each has the p.d.f. g, it follows from Eq. (3.5.9) 
that for all values of x and y the joint p.d.f. f(x, y) of X and Y will be specified by 
the relation f(x, y) = g(x)g(y). Hence, 


4xy forO<x<land0O<y<l, 


0 otherwise. 


fo. =| 


The set S in the xy-plane, where f(x, y) > 0, and the subset Sp, where x + y < 1, are 
sketched in Fig. 3.19. Thus, 


1 pl—-x 
Pix+ysn=f [ to.axdy= f / cere eee 
So 0 JO 6 


As a final note, if the two measurements X and Y had been made on the same day at 
nearby locations, then it might not make as much sense to treat them as independent, 
since we would expect them to be more similar to each other than to historical 
rainfalls. For example, if we first learn that X is small compared to historical rainfall 
on the date in question, we might then expect Y to be smaller than the historical 
distribution would suggest. < 


Example 
3.5.10 


Theorem 
3.5.6 
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Theorem 3.5.5 says that X and Y are independent if and only if, for all values of 
x and y, f can be factored into the product of an arbitrary nonnegative function of x 
and an arbitrary nonnegative function of y. However, it should be emphasized that, 
just as in Eq. (3.5.9), the factorization in Eq. (3.5.7) must be satisfied for all values of 
x and y (—oo < x < coand —@ < y<ov). 


Dependent Random Variables. Suppose that the joint p.d.f. of X and Y has the follow- 
ing form: 


kx?y? for 27 497 <1, 
0 otherwise. 


fony=| 


We shall show that X and Y are not independent. 

It is evident that at each point inside the circle x” + y? < 1, f (x, y) can be factored 
as in Eq. (3.5.7). However, this same factorization cannot also be satisfied at every 
point outside this circle. For example, f (0.9, 0.9) =0, but neither f,(0.9) =0 nor 
f(0.9) = 0. (In Exercise 13, you can verify this feature of f, and f,.) 

The important feature of this example is that the values of X and Y are con- 
strained to lie inside a circle. The joint p.d.f. of X and Y is positive inside the circle 
and zero outside the circle. Under these conditions, X and Y cannot be independent, 
because for every given value y of Y, the possible values of X will depend on y. For 
example, if Y =0, then X can have any value such that X2<1:if Y= 1/2, then X 
must have a value such that X? < 3/4. < 


Example 3.5.10 shows that one must be careful when trying to apply Theo- 
rem 3.5.5. The situation that arose in that example will occur whenever {(x, y): 
Ft (x, y) > 0} has boundaries that are curved or not parallel to the coordinate axes. 
There is one important special case in which it is easy to check the conditions of 
Theorem 3.5.5. The proof is left as an exercise. 


Let X and Y have a continuous joint distribution. Suppose that {(x, y): f(x, y) > O} 
is a rectangular region R (possibly unbounded) with sides (if any) parallel to the 
coordinate axes. Then X and Y are independent if and only if Eq. (3.5.7) holds for 
all (x, y) ER. | 


Verifying the Factorization of a Joint p.d.f. Suppose that the joint p.d.f. f of X and Y is 
as follows: 


ke7"+2Y) for x >Oand y > 0, 


0 otherwise, 


fe. =| 


where k is some constant. We shall first determine whether X and Y are independent 
and then determine their marginal p.d.f.’s. 

In this example, f(x, y) = 0 outside of an unbounded rectangular region R whose 
sides are the lines x = 0 and y = 0. Furthermore, at each point inside R, f(x, y) can 
be factored as in Eq. (3.5.7) by letting hj(x) = ke~* and h2(y) = e~*». Therefore, X 
and Y are independent. 

It follows that in this case, except for constant factors, h,(x) for x > 0 and h(y) 
for y > 0 must be the marginal p.d.f.’s of X and Y. By choosing constants that make 
h,(x) and hy(y) integrate to unity, we can conclude that the marginal p.d.f’s f; and 
fo of X and Y must be as follows: 

e* forx >0, 


f= | 


0 otherwise, 
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and 


—2y 
ho) = | = a) 


0 otherwise. 


If we multiply f,(x) times f(y) and compare the product to f(x, y), we see that 
k=2. < 


Note: Separate Functions of Independent Random Variables Are Independent. If 
X and Y are independent, then (X) and g(Y) are independent no matter what the 
functions h and g are. This is true because for every t, the event {h(X) < rt} can always 
be written as {X € A}, where A = {x : h(x) < t}. Similarly, {g(Y) <u} can be written 
as {Y € B}, so Eq. (3.5.6) for A(X) and g(Y) follows from Eq. (3.5.5) for X and Y. 


Summary 


Let f(x, y) be a joint pf, joint p.df., or joint p.f/p.d.f. of two random variables X 
and Y. The marginal p.f. or p.d-f. of X is denoted by f,(x), and the marginal p.f. or 
p.d.f. of Y is denoted by f,(y). To obtain f,(x), compute >°, f(x, y) if Y is discrete 
or Joe f (x, y) dy if Y is continuous. Similarly, to obtain f,(y), compute }°. f(x, y) 
if X is discrete or es f (x, y) dx if X is continuous. The random variables X and 
Y are independent if and only if f(x, y) = f,(x) f2() for all x and y. This is true 
regardless of whether X and/or Y is continuous or discrete. A sufficient condition for 
two continuous random variables to be independent is that R = {(x, y): f(x, y) > O} 
be rectangular with sides parallel to the coordinate axes and that f(x, y) factors into 
separate functions of x of y in R. 


1. Suppose that X and Y have a continuous joint distribu- 
tion for which the joint p.d.f. is 


k fora<x<bandc<y<d, 


fo. =| 


0 otherwise, 


where a < b,c <d,andk > 0. Find the marginal distribu- 
tions of X and Y. 


2. Suppose that X and Y have a discrete joint distribution 
for which the joint p.f. is defined as follows: 


a(x +y) forx =0, 1,2 and y=0, 1,2, 3, 


0 otherwise. 


fa | 


a. Determine the marginal p.f’s of X and Y. 
b. Are X and Y independent? 


3. Suppose that X and Y have a continuous joint distribu- 
tion for which the joint p.d.f. is defined as follows: 


By? for0<x <2and0<y<1, 
0 otherwise. 


fen n={ 


a. Determine the marginal p.d.f’s of X and Y. 
b. Are X and Y independent? 


c. Are the event {X < 1} and the event {Y > 1/2} inde- 
pendent? 


4. Suppose that the joint p.d-f. of X and Y is as follows: 


fo n= {ar for0sy 51-27, 
; 0 otherwise. 


a. Determine the marginal p.d.f’s of X and Y. 
b. Are X and Y independent? 


5. A certain drugstore has three public telephone booths. 
For i = 0, 1, 2, 3, let p; denote the probability that ex- 
actly i telephone booths will be occupied on any Monday 
evening at 8:00 p.M.; and suppose that pp = 0.1, py = 0.2, 
P2 = 0.4, and p3 = 0.3. Let X and Y denote the number of 
booths that will be occupied at 8:00 p.m. on two indepen- 
dent Monday evenings. Determine: (a) the joint p.f. of X 
and Y; (b) Pr(X = Y); (c) Pr(x > Y). 


6. Suppose that in a certain drug the concentration of a 
particular chemical is a random variable with a continuous 
distribution for which the p.d.f. g is as follows: 


pe | ax? for0<x <2, 
0 otherwise. 


Suppose that the concentrations X and Y of the chemical 
in two separate batches of the drug are independent ran- 
dom variables for each of which the p.d-f. is g. Determine 
(a) the joint p.d.f. of X and Y; (b) Pr(X = Y); (ce) Pr(X > Y); 
(d) Pr(X + Y <1). 


7. Suppose that the joint p.d-f. of X and Y is as follows: 


2xe-% for0O<x<land0<y<wo, 


x, = 
FO») 0 otherwise. 

Are X and Y independent? 

8. Suppose that the joint p.d-f. of X and Y is as follows: 


#6, oe for x 20,y20,andx+ys1, 
0 otherwise. 
Are X and Y independent? 


9. Suppose that a point (X, Y) is chosen at random from 
the rectangle S defined as follows: 


S={(x, y) :0<x <2and1<y<4}. 


a. Determine the joint p.d.f. of X and Y, the marginal 
p.d.f. of X, and the marginal p.d-f. of Y. 
b. Are X and Y independent? 


10. Suppose that a point (X, Y) is chosen at random from 
the circle S defined as follows: 


BSf yx? ty? 21h 


a. Determine the joint p.d.f. of X and Y, the marginal 
p.d.f. of X, and the marginal p.d-f. of Y. 


b. Are X and Y independent? 
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11. Suppose that two persons make an appointment to 
meet between 5 p.M. and 6 P.M. at a certain location, and 
they agree that neither person will wait more than 10 
minutes for the other person. If they arrive independently 
at random times between 5 p.m. and 6 p.M., what is the 
probability that they will meet? 


12. Prove Theorem 3.5.6. 


13. In Example 3.5.10, verify that X and Y have the same 
marginal p.d.f.’s and that 


2kx2(1 — x”)9/2/3 if-1<x<1 
wa| erst 
fi 


otherwise. 


14. For the joint p.df. in Example 3.4.7, determine 
whether or not X and Y are independent. 


15. A painting process consists of two stages. In the first 
stage, the paint is applied, and in the second stage, a pro- 
tective coat is added. Let X be the time spent on the first 
stage, and let Y be the time spent on the second stage. The 
first stage involves an inspection. If the paint fails the in- 
spection, one must wait three minutes and apply the paint 
again. After a second application, there is no further in- 
spection. The joint p.d.f. of X and Y is 


5 ifl<x <3and0<y <1, 
f@ y=) % if6<x <8and0<y<1, 


0 otherwise. 


a. Sketch the region where f(x, y) > 0. Note that it is 
not exactly a rectangle. 


b. Find the marginal p.d.f’s of X and Y. 

c. Show that X and Y are independent. 
This problem does not contradict Theorem 3.5.6. In that 
theorem the conditions, including that the set where 


f(x, y) > 0 be rectangular, are sufficient but not neces- 
sary. 


3.6 Conditional Distributions 


We generalize the concept of conditional probability to conditional distributions. 
Recall that distributions are just collections of probabilities of events determined 
by random variables. Conditional distributions will be the probabilities of events 
determined by some random variables conditional on events determined by other 
random variables. The idea is that there will typically be many random variables of 
interest in an applied problem. After we observe some of those random variables, 
we want to be able to adjust the probabilities associated with the ones that have not 
yet been observed. The conditional distribution of one random variable X given 
another Y will be the distribution that we would use for X after we learn the value 


of Y. 
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Table 3.7 Joint p.f. for Example 3.6.1 
Brand Y 
Stolen X 1 2 3 4 5 Total 
0 0.129 0.298 0.161 0.280 0.108 0.976 
1 0.010 0.010 0.001 0.002 0.001 0.024 
Total 0.139 0.308 0.162 0.282 0.109 1.000 


Discrete Conditional Distributions 


Auto Insurance. Insurance companies keep track of how likely various cars are to be 
stolen. Suppose that a company in a particular area computes the joint distribution 
of car brands and the indicator of whether the car will be stolen during a particular 
year that appears in Table 3.7. 

We let X = 1 mean that a car is stolen, and we let X = 0 mean that the car is not 
stolen. We let Y take one of the values from 1 to 5 to indicate the brand of car as 
indicated in Table 3.7. If a customer applies for insurance for a particular brand of 
car, the company needs to compute the distribution of the random variable X as part 
of its premium determination. The insurance company might adjust their premium 
according to a risk factor such as likelihood of being stolen. Although, overall, the 
probability that a car will be stolen is 0.024, if we assume that we know the brand 
of car, the probability might change quite a bit. This section introduces the formal 
concepts for addressing this type of problem. < 


Suppose that X and Y are two random variables having a discrete joint distribu- 
tion for which the joint p.f. is f. As before, we shall let f, and f, denote the marginal 
p.f’s of X and Y, respectively. After we observe that Y = y, the probability that the 
random variable X will take a particular value x is specified by the following condi- 
tional probability: 


_ Pr(xX =x and Y = y) 


Prix =x|Y =y) Pr =) 


af (3.6.1) 
fry) 


In other words, if it is known that Y = y, then the probability that X =x will be 
updated to the value in Eq. (3.6.1). Next, we consider the entire distribution of X 
after learning that Y = y. 


Conditional Distribution/p.f. Let X and Y have a discrete joint distribution with joint 
p.f. f. Let f, denote the marginal p.f. of Y. For each y such that f5(y) > 0, define 


f@, y) 
cly)= . 3.6.2 
gi(x|y) Fo) ( ) 


Then g; is called the conditional p.f. of X given Y. The discrete distribution whose p.f. 
is g1(-|y) is called the conditional distribution of X given that Y = y. 


Example 
3.6.2 


Example 
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Table 3.8 Conditional pf. of Y given X for Exam- 
ple 3.6.3 


Brand Y 
Stolen X if 2 3 4 5 


0 0.928 0.968 0.994 0.993 0.991 
0.072 0.032 0.006 0.007 0.009 


We should verify that g)(x|y) is actually a p.f. as a function of x for each y. Let y be 
such that f(y) > 0. Then g;(x|y) = 0 for all x and 


1 1 
gixly) = —— )_ fa, y)=—— fy) = 1. 
» fy) dX fa) 
Notice that we do not bother to define g,(x|y) for those y such that f5(y) = 0. 
Similarly, if x is a given value of X such that f(x) = Pr(X = x) > 0, and if g5(y|x) 
is the conditional p.f. of Y given that X = x, then 


f(x, y) 
ee ae 3.6.3 
go(yIx) FG) (3.6.3) 


For each x such that f,(x) > 0, the function g>(y|x) will be a p.f. as a function of y. 


Calculating a Conditional p.f. from a Joint p.f. Suppose that the joint p.f. of X and Y is 
as specified in Table 3.4 in Example 3.5.2. We shall determine the conditional p.f. of 
Y given that X =2. 

The marginal p.f. of X appears in the Total column of Table 3.4, so f,(2) = Pr(X = 
2) = 0.6. Therefore, the conditional probability g5(y|2) that Y will take a particular 
value y is 


f (2, y) 
0.6 — 


It should be noted that for all possible values of y, the conditional probabilities 
g>(y|2) must be proportional to the joint probabilities f(2, y). In this example, each 
value of f (2, y) is simply divided by the constant f;(2) = 0.6 in order that the sum of 
the results will be equal to 1. Thus, 


§2(1|2) = 1/2, go(2/2)=0, 3/2) =1/6, gp (4/2) = 1/3. < 


82(y|2) = 


Auto Insurance. Consider again the probabilities of car brands and cars being stolen 
in Example 3.6.1. The conditional distribution of X (being stolen) given Y (brand) 
is given in Table 3.8. It appears that Brand 1 is much more likely to be stolen than 
other cars in this area, with Brand 1 also having a significant chance of being stolen. 

< 


Continuous Conditional Distributions 


Processing Times. A manufacturing process consists of two stages. The first stage 
takes Y minutes, and the whole process takes X minutes (which includes the first 
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Y minutes). Suppose that X and Y have a joint continuous distribution with joint 
p.d.f. 

e* for0O<y<x<«a, 

FQ, y) =| ants 

0 otherwise. 
After we learn how much time Y that the first stage takes, we want to update our 
distribution for the total time X. In other words, we would like to be able to compute 
a conditional distribution for X given Y = y. We cannot argue the same way as we 
did with discrete joint distributions, because {Y = y} is an event with probability 0 
for all y. 4 


To facilitate the solutions of problems such as the one posed in Example 3.6.4, 
the concept of conditional probability will be extended by considering the definition 
of the conditional p.f. of X given in Eq. (3.6.2) and the analogy between a p.f. and a 
p.d.f. 


Conditional p.d.f. Let X and Y have a continuous joint distribution with joint p.d-f. 
f and respective marginals f; and f>. Let y be a value such that f>(y) > 0. Then the 
conditional p.d.f. g, of X given that Y = y is defined as follows: 


f(x, y) 
ho) 


For values of y such that f(y) = 0, we are free to define g;(x|y) however we wish, 
so long as g;(x|y) is a p.d.f. as a function of x. 


gy(xly) = for —0oo <x < ©. (3.6.4) 


It should be noted that Eq. (3.6.2) and Eq. (3.6.4) are identical. However, 
Eq. (3.6.2) was derived as the conditional probability that X =x given that Y = y, 
whereas Eq. (3.6.4) was defined to be the value of the conditional p.d.f. of X given 
that Y = y. In fact, we should verify that g)(x|y) as defined above really is a p.d-f. 


For each y, g;(x|y) defined in Definition 3.6.2 is a p.d-f. as a function of x. 
Proof If f,(y) =0, then g; is defined to be any p.d.f. we wish, and hence it is a p.d_-f. 


If fo(y) > 0, g; is defined by Eq. (3.6.4). For each such y, it is clear that g;(x|y) > 0 
for all x. Also, if f(y) > 0, then 


ag Pro FG dx fy) 
dx = = =1, 
[aba AO) AO) 
by using the formula for f5(y) in Eq. (3.5.3). | 


Processing Times. In Example 3.6.4, Y is the time that the first stage of a process takes, 
while X is the total time of the two stages. We want to calculate the conditional p.d.f. 
of X given Y. We can calculate the marginal p.d.f. of Y as follows: For each y, the 
possible values of X are all x > y, so for each y > 0, 


[oe] 
fly) = i ede=e, 
y 


and f>(y) =0 for y < 0. For each y > 0, the conditional p.d.f. of X given Y = y is then 


=x 


f@,y) _e 
fay) e 


sialy) = =e’, forx>y, 


Figure 3.20 The condi- 
tional p.d.f. g)(x| yg) is pro- 
portional to f(x, yo). 
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ry 


and g\(x|y) =0 for x < y. So, for example, if we observe Y = 4 and we want the 
conditional probability that X > 9, we compute 


[o.@) 
Pr(X >9|/Y =4)= [ e*-*dx =e = 0.0067. < 
9 


Definition 3.6.2 has an interpretation that can be understood by considering 
Fig. 3.20. The joint p.d.f. f defines a surface over the xy-plane for which the height 
f(x, y) at each point (x, y) represents the relative likelihood of that point. For 
instance, if it is known that Y = yo, then the point (x, y) must lie on the line y = yg in 
the xy-plane, and the relative likelihood of any point (x, yo) on this line is f(x, yo). 
Hence, the conditional p.d.f. g;(x|yo) of X should be proportional to f (x, yo). In other 
words, g1(x|yo) is essentially the same as f(x, yo), but it includes a constant factor 
1/[.f200)], which is required to make the conditional p.d.f. integrate to unity over all 
values of x. 

Similarly, for each value of x such that f\(x) > 0, the conditional p.d.f. of Y given 
that X = x is defined as follows: 


f(x, y) 
fix) 


This equation is identical to Eq. (3.6.3), which was derived for discrete distributions. 
If f;(x) =0, then g>(y|x) is arbitrary so long as it is a p.d.f. as a function of y. 


go(y|x) = for —co < y < oo. (3.6.5) 


Calculating a Conditional p.d.f. from a Joint p.d.f. Suppose that the joint p.d.f. of X and 
Y is as specified in Example 3.4.8 on page 122. We shall first determine the conditional 
p.d.f. of Y given that X = x and then determine some probabilities for Y given the 
specific value X = 1/2. 

The set S for which f(x, y) > 0 was sketched in Fig. 3.12 on page 123. Further- 
more, the marginal p.d.f. f, was derived in Example 3.5.3 on page 132 and sketched 
in Fig. 3.17 on page 133. It can be seen from Fig. 3.17 that f,(x) > 0 for —1 <x <1but 
not for x = 0. Therefore, for each given value of x such that —1 <x <0or0 <x <1, 
the conditional p.d.f. go(y|x) of Y will be as follows: 


2 
sole) TF arS* 


0 otherwise. 
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In particular, if it is known that X = 1/2, then Pr(¥ > a1X = 3) = land 


pr(y>=|x=+)= >) ay=Z. < 
4 2 2 15 


Note: A Conditional p.d.f. Is Not the Result of Conditioning on a Set of Probability 
Zero. The conditional p.d-f. g;(x|y) of X given Y = y is the p.d.f. we would use for 
X if we were to learn that Y = y. This sounds as if we were conditioning on the event 
{Y = y}, which has zero probability if Y has a continuous distribution. Actually, for 
the cases we shall see in this text, the value of g(x|y) is a limit: 


gi(xly) = lim © Pr(X <xly-e<¥ sy +e), (3.6.6) 
€e> Xx 


The conditioning event {y —« < Y < y+ «} in Eq. (3.6.6) has positive probability if 
the marginal p.d_-f. of Y is positive at y. The mathematics required to make this rigor- 
ous is beyond the scope of this text. (See Exercise 11 in this section and Exercises 25 
and 26 in Sec. 3.11 for results that we can prove.) Another way to think about condi- 
tioning on a continuous random variable is to notice that the conditional p.d.f.’s that 
we compute are typically continuous as a function of the conditioning variable. This 
means that conditioning on Y = y or on Y= y +e for small € will produce nearly 
the same conditional distribution for X. So it does not matter much if we use Y = y 
as a surogate for Y close to y. Nevertheless, it is important to keep in mind that the 
conditional p.d-f. of X given Y = y is better thought of as the conditional p.d.f. of X 
given that Y is very close to y. This wording is awkward, so we shall not use it, but 
we must remember the distinction between the conditional p.d.f. and conditioning 
on an event with probability 0. Despite this distinction, it is still legitimate to treat Y 
as the constant y when dealing with the conditional distribution of X given Y = y. 

For mixed joint distributions, we continue to use Eqs. (3.6.2) and (3.6.3) to define 
conditional p.f’s and p.d.f’s. 


Conditional p.f. or p.d.f. from Mixed Distribution. Let X be discrete and let Y be 
continuous with joint p.f./p.d.f. f. Then the conditional p.f. of X given Y = y is defined 
by Eq. (3.6.2), and the conditional p.d.f of Y given X = x is defined by Eq. (3.6.3). 


Construction of the Joint Distribution 


Defective Parts. Suppose that a certain machine produces defective and nondefective 
parts, but we do not know what proportion of defectives we would find among 
all parts that could be produced by this machine. Let P stand for the unknown 
proportion of defective parts among all possible parts produced by the machine. If we 
were to learn that P = p, we might be willing to say that the parts were independent 
of each other and each had probability p of being defective. In other words, if we 
condition on P = p, then we have the situation described in Example 3.1.9. As in 
that example, suppose that we examine n parts and let X stand for the number of 
defectives among the n examined parts. The distribution of X, assuming that we know 
P =p, is the binomial distribution with parameters n and p. That is, we can let the 
binomial p.f. (3.1.4) be the conditional p.f. of X given P = p, namely, 


gy(x|p) = (“Jona — p)"*, forx =0,...,n. 
x 


Theorem 
3.6.2 


Example 
3.6.8 
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We might also believe that P has a continuous distribution with p.d.f. suchas fo(p) =1 
for 0 < p <1. (This means that P has the uniform distribution on the interval [0, 1].) 
We know that the conditional p.f. g; of X given P = p satisfies 


f(x, p) 
fr(p) ” 


where f is the joint p.f/p.d.f. of X and P. If we multiply both sides of this equation 
by f>(p), it follows that the joint p.f./p.d-f. of X and P is 


gy(x|p) = 


f(x, p) = extslp) 210) = ( )era- pr forx =0,...,n, and0O<p<l. 


n 
Xx 
<l 
The construction in Example 3.6.7 is available in general, as we explain next. 


Generalizing the Multiplication Rule for Conditional Probabilities A special case 
of Theorem 2.1.2, the multiplication rule for conditional probabilities, says that if 
A and B are two events, then Pr(A N B) = Pr(A) Pr(B|A). The following theorem, 
whose proof is immediate from Eqs. (3.6.4) and (3.6.5), generalizes Theorem 2.1.2 to 
the case of two random variables. 


Multiplication Rule for Distributions. Let X and Y be random variables such that X 
has p.f. or p.d.f. f,(7) and Y has pf. or p.d.f. f5(). Also, assume that the conditional 
p.f. or p.d.f. of X given Y = y is g;(x|y) while the conditional p.f. or p.d.f. of Y given 
X =x is go(y|x). Then for each y such that f5(y) > 0 and each x, 


fa y=aly fh), (3.6.7) 


where f is the joint p.f., p.d-f,, or p.f./p.d.f. of X and Y. Similarly, for each x such that 
f\(x) > 0 and each y, 


f(x, y) = fit) go(y 14). (3.6.8) 
| 


In Theorem 3.6.2, if f;(yo) = 0 for some value yo, then it can be assumed without 
loss of generality that f(x, yg) =0 for all values of x. In this case, both sides of 
Eq. (3.6.7) will be 0, and the fact that g(x|yo) is not uniquely defined becomes 
irrelevant. Hence, Eq. (3.6.7) will be satisfied for all values of x and y. A similar 
statement applies to Eq. (3.6.8). 


Waiting ina Queue. Let X be the amount of time that a person has to wait for service 
in a queue. The faster the server works in the queue, the shorter should be the 
waiting time. Let Y stand for the rate at which the server works, which we will take 
to be unknown. A common choice of conditional distribution for X given Y = y has 
conditional p.d.f. for each y > 0: 

ye *» forx >0, 

&ixly) = ; 

0 otherwise. 
We shall assume that Y has a continuous distribution with p.d-f. f(y) = e~” for y > 0. 
Now we can construct the joint p.d.f. of X and Y using Theorem 3.6.2: 
for x >0, y > 0, 


—y(x+)) 
fa. y= ably AQ) = | i 
0 otherwise. < 
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Defective Parts. Let X be the number of defective parts in a sample of size n, and 
let P be the proportion of defectives among all parts, as in Example 3.6.7. The joint 
p.f/p.d.f of X and P = p was calculated there as 


f, Pp) = 91lp) fo(p) = ("\ora —p)"™*, forx=0,...,nand0<p<1. 


We could now compute the conditional p.d.f. of P given X = x by first finding the 
marginal p.f. of X: 


1 
n C —x 
fi) =[ (")ova =p)" “ap, (3.6.9) 
The conditional p.d-f. of P given X = x is then 


ff, Pp) __ ped—p)"* 
fi) — fi gtd —qyr-*dq 


82(p|x) = for0 <p <1. (3.6.10) 


The integral in the denominator of Eq. (3.6.10) can be tedious to calculate, but it can 
be found. For example, if n = 2 and x = 1, we get 


1 
_ d SS. SS SS Ss, 
[au q)dq 5 6 


In this case, g5(p|1) = 6p(1 — p) for0 < p <1. < 


Bayes’ Theorem and the Law of Total Probability for Random Variables The 
calculation done in Eq. (3.6.9) is an example of the generalization of the law of total 
probability to random variables. Also, the calculation in Eq. (3.6.10) is an example of 
the generalization of Bayes’ theorem to random variables. The proofs of these results 
are straightforward and not given here. 


Law of Total Probability for Random Variables. If f,(y) is the marginal p.f. or p.d.f. of a 
random variable Y and g;(x|y) is the conditional p.f. or p.d.f. of X given Y = y, then 
the marginal p.f. or p.d.f. of X is 


A@) => Gly) AG), (3.6.11) 
¥ 
if Y is discrete. If Y is continuous, the marginal p.f. or p.d.f. of X is 
[o,@) 
AwO=[ sclyAoray. (3.6.12) 
7 


There are versions of Eqs. (3.6.11) and (3.6.12) with x and y switched and the 
subscripts 1 and 2 switched. These versions would be used if the joint distribution 
of X and Y were constructed from the conditional distribution of Y given X and the 
marginal distribution of X. 


Bayes’ Theorem for Random Variables. If f(y) is the marginal p.f. or p.d.f. of arandom 
variable Y and g;(x|y) is the conditional p.f. or p.d-f. of X given Y = y, then the 
conditional p.f. or p.d.f. of Y given X = x is 


gGly) AO) (3.6.13) 


aa(y|x) = F(x) 


Example 
3.6.10 


Figure 3.21 The marginal 
p.d.f. of Y in Example 3.6.10. 
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where f\(x) is obtained from Eq. (3.6.11) or (3.6.12). Similarly, the conditional p.f. 
or p.d.f. of X given Y = y is 


S2(yl) AiG) 


&i@ly) = (3.6.14) 

fry) 
where f>(y) is obtained from Eq. (3.6.11) or (3.6.12) with x and y switched and with 
the subscripts 1 and 2 switched. a 


Choosing Points from Uniform Distributions. Suppose that a point X is chosen from 
the uniform distribution on the interval [0, 1], and that after the value X = x has been 
observed (0 < x < 1), a point Y is then chosen from the uniform distribution on the 
interval [x, 1]. We shall derive the marginal p.d.f. of Y. 

Since X has a uniform distribution, the marginal p.d.f. of X is as follows: 


il for0 <x <1, 
fi) = 

0 otherwise. 
Similarly, for each value X = x (0 <x <1), the conditional distribution of Y is the 
uniform distribution on the interval [x, 1]. Since the length of this interval is 1 — x, 
the conditional p.d-f. of Y given that X = x will be 


ee | 1 
evixyy=4) 1x iia ae! 


0 otherwise. 


It follows from Eq. (3.6.8) that the joint p.d.f. of X and Y will be 


1 
fa. y)=4 [ox for0<x<y<l, (3.6.15) 
0 otherwise. 


Thus, for 0 < y <1, the value of the marginal p.d.f. f,(y) of Y will be 


lo.e) y 

fy) = / fe, y)dx= i: + dy=-log(l-y). 3.6.16) 
er o 1-x 

Furthermore, since Y cannot be outside the interval 0 < y <1, then f(y) =0 for 

y <Oor y >1. This marginal p.d.f. f, is sketched in Fig. 3.21. It is interesting to note 
that in this example the function f> is unbounded. 

We can also find the conditional p.d-f. of X given Y = y by applying Bayes’ theo- 

rem (3.6.14). The product of g5(y|x) and f;(x) was already calculated in Eq. (3.6.15). 


folyyA 


ey 
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The ratio of this product to f>(y) from Eq. (3.6.16) is 
-1 

&iG@ly) = 4 (—x) log( — y) 

0 otherwise. < 


for0<x<vy, 


Independent Random Variables. Suppose that X and Y are two random variables 
having a joint p.f., p.d-f, or p.f/p.d.f. f. Then X and Y are independent if and only if 
for every value of y such that f5(y) > 0 and every value of x, 


gixly) = fia). (3.6.17) 


Proof Theorem 3.5.4 says that X and Y are independent if and only if f(x, y) can be 
factored in the following form for —oo < x < co and —oo < y < ow: 


f@W=AMhA), 
which holds if and only if, for all x and all y such that f4(y) > 0, 


fo (3.6.18) 


fry) 
But the right side of Eq. (3.6.18) is the formula for g;(x|y). Hence, X and Y are 
independent if and only if Eq. (3.6.17) holds for all x and all y such that f5(y) > 0. 
rT 


Theorem 3.6.5 says that X and Y are independent if and only if the conditional p.f. or 
p.d.f. of X given Y = y is the same as the marginal p.f. or p.d.f. of X for all y such that 
fo(y) > 0. Because g;(x|y) is arbitrary when f5(y) = 0, we cannot expect Eq. (3.6.17) 
to hold in that case. 

Similarly, it follows from Eq. (3.6.8) that X and Y are independent if and only if 


82(ylx) = faly), (3.6.19) 


for every value of x such that f;(x) > 0. Theorem 3.6.5 and Eq. (3.6.19) give the 
mathematical justification for the meaning of independence that we presented on 
page 136. 


Note: Conditional Distributions Behave Just Like Distributions. As we noted on 
page 59, conditional probabilities behave just like probabilities. Since distributions 
are just collections of probabilities, it follows that conditional distributions behave 
just like distributions. For example, to compute the conditional probability that a 
discrete random variable X is in some interval [a, b] given Y = y, we must add gj(x|y) 
for all values of x in the interval. Also, theorems that we have proven or shall prove 
about distributions will have versions conditional on additional random variables. 
We shall postpone examples of such theorems until Sec. 3.7 because they rely on 
joint distributions of more than two random variables. 


Summary 


The conditional distribution of one random variable X given an observed value y 
of another random variable Y is the distribution we would use for X if we were to 
learn that Y = y. When dealing with the conditional distribution of X given Y = y, 
it is safe to behave as if Y were the constant y. If X and Y have joint p.f., p.d-f., 
or p.f/p.d.f. f(x, y), then the conditional p.f. or p.d.f. of X given Y = y is gy(x|y) = 
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f(, y)/fo(y), where f> is the marginal p.f. or p.d.f. of Y. When it is convenient to 
specify a conditional distribution directly, the joint distribution can be constructed 
from the conditional together with the other marginal. For example, 


fy) = 91 ly) fo) = fi@)a20|*). 


In this case, we have versions of the law of total probability and Bayes’ theorem for 
random variables that allow us to calculate the other marginal and conditional. 
Two random variables X and Y are independent if and only if the conditional p.f. 
or p.d.f. of X given Y = y is the same as the marginal p.f. or p.d.f. of X for all y such 
that f,(y) > 0. Equivalently, X and Y are independent if and only if the conditional 
p.f. of p.d.f. of Y given X = x is the same as the marginal pf. or p.d.f. of Y for all x 


such that f(x) > 0. 


Exercises 


1. Suppose that two random variables X and Y have the 
joint p.d-f. in Example 3.5.10 on page 139. Compute the 
conditional p.d.f. of X given Y = y for each y. 


2. Each student in a certain high school was classified ac- 
cording to her year in school (freshman, sophomore, ju- 
nior, or senior) and according to the number of times that 
she had visited a certain museum (never, once, or more 
than once). The proportions of students in the various clas- 
sifications are given in the following table: 


More 
Never Once than once 
Freshmen 0.08 0.10 0.04 
Sophomores 0.04 0.10 0.04 
Juniors 0.04 0.20 0.09 
Seniors 0.02 0.15 0.10 


a. Ifastudent selected at random from the high school 
is a junior, what is the probability that she has never 
visited the museum? 


b. Ifa student selected at random from the high school 
has visited the museum three times, what is the prob- 
ability that she is a senior? 


3. Suppose that a point (X, Y) is chosen at random from 
the disk S defined as follows: 


S={(x, y):@ —1)? + (y +2)* <9}. 


Determine (a) the conditional p.d.f. of Y for every given 
value of X, and (b) Pr(Y > O|X = 2). 


4. Suppose that the joint p.d.f. of two random variables X 
and Y is as follows: 


f(x, y= | cx +y*) forO<x<land0<y<1, 
0 otherwise. 


Determine (a) the conditional p.d.f. of X for every given 
value of Y, and (b) Pr(X < sl¥ — 5). 


5. Suppose that the joint p.d.f of two points X and Y 
chosen by the process described in Example 3.6.10 is as 
given by Eq. (3.6.15). Determine (a) the conditional p.d.f. 


of X for every given value of Y, and (b) Pr(X > 3|¥ = 3). 


6. Suppose that the joint p.d.f. of two random variables X 
and Y is as follows: 


csinx for0<x <za/2and0<y <3, 


0 otherwise. 


fo. =| 


Determine (a) the conditional p.d-f. of Y for every given 
value of X, and (b) Pr(1 < Y < 2|X =0.73). 


7. Suppose that the joint p.d.f. of two random variables X 
and Y is as follows: 


%(4-2x-y) forx>0,y>0, 


ff, y= and 2x + y <4, 


0 otherwise. 


Determine (a) the conditional p.d-f. of Y for every given 
value of X, and (b) Pr(Y > 2|X =0.5). 


8. Suppose that a person’s score X on a mathematics ap- 
titude test is a number between 0 and 1, and that his score 
Y on a music aptitude test is also a number between 0 
and 1. Suppose further that in the population of all col- 
lege students in the United States, the scores X and Y are 
distributed according to the following joint p.d.f:: 


f(x ae for0<x <land0<y<1, 
0 


otherwise. 
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a. What proportion of college students obtain a score 
greater than 0.8 on the mathematics test? 


b. Ifastudent’s score on the music test is 0.3, what is the 
probability that his score on the mathematics test will 
be greater than 0.8? 


c. If a student’s score on the mathematics test is 0.3, 
what is the probability that his score on the music 
test will be greater than 0.8? 


9. Suppose that either of two instruments might be used 
for making a certain measurement. Instrument 1 yields a 
measurement whose p.d-f. h, is 


2x for0<x <1, 
0 otherwise. 


hy(x) = | 


Instrument 2 yields a measurement whose p.d.f. 12 is 


2 
eee 3x? for0<x <1, 


0 otherwise. 


Suppose that one of the two instruments is chosen at ran- 
dom and a measurement X is made with it. 


a. Determine the marginal p.d.f. of X. 


b. If the value of the measurement is X = 1/4, what is 
the probability that instrument 1 was used? 


10. In a large collection of coins, the probability X that a 
head will be obtained when a coin is tossed varies from one 
coin to another, and the distribution of X in the collection 
is specified by the following p.d-f.: 


6x(1—x) forO<x <1, 
0 otherwise. 


fi) = 
Suppose that a coin is selected at random from the collec- 
tion and tossed once, and that a head is obtained. Deter- 
mine the conditional p.d.f. of X for this coin. 


11. The definition of the conditional p.d-f. of X given Y = 
y is arbitrary if f>(y) = 0. The reason that this causes no 
serious problem is that it is highly unlikely that we will 
observe Y close to a value yg such that f5(y9) = 0. To be 
more precise, let f>(y9) = 0, and let Ap = [yo — €, yo + €]. 
Also, let y, be such that f(y)) > 0, and let Ay =[y, — 
€, y, te]. Assume that f5 is continuous at both yo and yy. 
Show that 

vg PUY Ad) 

«>0 Pr(Y e A)) 
That is, the probability that Y is close to yo is much smaller 
than the probability that Y is close to yy. 


12. Let Y be the rate (calls per hour) at which calls arrive 
at a switchboard. Let X be the number of calls during a 
two-hour period. Suppose that the marginal p.d.f. of Y is 
e” ify>0, 

0 otherwise, 


ho= | 


and that the conditional p.f. of X given Y = y is 


a a ee 
si@ly)= 7 x! i laa 


0 otherwise. 


a. Find the marginal p.f. of X. (You may use the formula 
Sor yke? dy =k1) 

b. Find the conditional p.d.f. go(y|0) of Y given X =0. 

Find the conditional p.d.f. go(y|1) of Y given X = 1. 


a9 


. For what values of y is g(y|1) > g2(y|0)? Does this 
agree with the intuition that the more calls you see, 
the higher you should think the rate is? 


13. Start with the joint distribution of treatment group 
and response in Table 3.6 on page 138. For each treatment 
group, compute the conditional distribution of response 
given the treatment group. Do they appear to be very 
similar or quite different? 


3.7 Multivariate Distributions 


In this section, we shall extend the results that were developed in Sections 3.4, 
3.5, and 3.6 for two random variables X and Y to an arbitrary finite number 


n of random variables X,,.. 


., X,. In general, the joint distribution of more 


than two random variables is called a multivariate distribution. The theory of 
statistical inference (the subject of the part of this book beginning with Chapter 7) 
relies on mathematical models for observable data in which each observation is 
a random variable. For this reason, multivariate distributions arise naturally in 
the mathematical models for data. The most commonly used model will be one in 
which the individual data random variables are conditionally independent given 
one or two other random variables. 


Example 
3.7.1 


Definition 
3.7.1 


Example 
3.7.2 


Definition 
3.7.2 
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Joint Distributions 


A Clinical Trial. Suppose that m patients with a certain medical condition are given a 
treatment, and each patient either recovers from the condition or fails to recover. For 
eachi =1,...,m, we can let X; = 1if patient i recovers and X; = 0 if not. We might 
also believe that there is a random variable P having a continuous distribution taking 
values between 0 and 1 such that, if we knew that P = p, we would say that the m 
patients recover or fail to recover independently of each other each with probability 
p of recovery. We now have named n =m + 1 random variables in which we are 
interested. < 


The situation described in Example 3.7.1 requires us to construct a joint distri- 
bution for n random variables. We shall now provide definitions and examples of the 
important concepts needed to discuss multivariate distributions. 


Joint Distribution Function/c.d.f. The joint c.d.f of n random variables X,,..., X,, is 
the function F whose value at every point (x1, ..., x,,) in n-dimensional space R” is 
specified by the relation 


F(x, Seite Xp) => Pr(Xx, < x1, XxX) < XQ, +005 XxX, < Xp (3.7.1) 
Every multivariate c.d-f. satisfies properties similar to those given earlier for univari- 
ate and bivariate c.d.f’s. 


Failure Times. Suppose that a machine has three parts, and part i will fail at time X; 
for i = 1, 2, 3. The following function might be the joint c.d-f. of X1, X5, and X3: 
(1 — e1) (1 — e-2)(1 — e338) for X14, X92, x3 = 0, 


F(x, x2, x3) => 
0 otherwise. <j 


Vector Notation In the study of the joint distribution of n random variables 
X1,..., X,, it is often convenient to use the vector notation X = (X;,..., X,,) and 
to refer to X as a random vector. Instead of speaking of the joint distribution of 
the random variables X;,..., X,, with a joint c.d-f. F(xy,...,X,), we can simply 
speak of the distribution of the random vector X with c.d.f. F(x). When this vector 
notation is used, it must be kept in mind that if X is an n-dimensional random vec- 
tor, then its c.d.f. is defined as a function on n-dimensional space R”. At each point 
¥ = (x1,...,xX,) € R”, the value of F(x) is specified by Eq. (3.7.1). 


Joint Discrete Distribution/p.f. It is said that n random variables X,,..., X, have a 
discrete joint distribution if the random vector (Xj, ..., X,) can have only a finite 
number or an infinite sequence of different possible values (x1, ..., x,) in R”. The 
joint p.f of X,,..., X, is then defined as the function f such that for every point 
(x4, ---,X,) ER", 


Sq, +, X,) = Pr(Xy = xy, .--, X_ = Xp). 
In vector notation, Definition 3.7.2 says that the random vector X has a discrete 
distribution and that its p.f. is specified at every point x € R” by the relation 


f(x) =Pr(x =x). 
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The following result is a simple generalization of Theorem 3.4.2. 


If X has a joint discrete distribution with joint p.f. f, then for every subset C C R", 
PrX¥eC)=)> f@). : 


xeC 


It is easy to show that, if each of X,,..., X, has a discrete distribution, then 
X = (Xj,..., X,) has a discrete joint distribution. 


A Clinical Trial. Consider the m patients in Example 3.7.1. Suppose for now that 
P =p: is known so that we don’t treat it as a random variable. The joint p.f. of 
X =(Xj,..., Xm) is 


f() = pute tm (1 = pyr Xin 


for all x; € {0, 1} and 0 otherwise. < 


Continuous Distribution/p.d.f. It is said that n random variables X,,..., X, have a 
continuous joint distribution if there is a nonnegative function f defined on R” such 
that for every subset C C R”, 

Pr[ (Xj, re Xn) Se Cl -|/ af SO, hee Xn) dx, le dxy, (3.7.2) 


if the integral exists. The function f is called the joint p.d.f of X1,..., Xy- 


In vector notation, f (x) denotes the p.d.f. of the random vector X and Eq. (3.7.2) 
could be rewritten more simply in the form 


prxecy= f+. f fayax. 


If the joint distribution of X;,..., X,, is continuous, then the joint p.d-f. f can be 
derived from the joint c.d.-f. F by using the relation 


OF (x4, .--5 Xp) 
(%4,...;%,) = —_—_— + 
f : . Ox see OX, 
at all points (x,,..., x,,) at which the derivative in this relation exists. a 


Failure Times. We can find the joint p.d.f. for the three random variables in Exam- 
ple 3.7.2 by applying Theorem 3.7.2. The third-order mixed partial is easily calculated 
to be 


for x1, X2, x3 > 0, 


6e7 817-242-333 
f (1; X2, *3) = 


otherwise. < 


It is important to note that, even if each of X,,..., X,, has a continuous distri- 
bution, the vector X = (X,..., X,,) might not have a continuous joint distribution. 
See Exercise 9 in this section. 


Service Times in a Queue. A queue is a system in which customers line up for service 
and receive their service according to some algorithm. A simple model is the single- 
server queue, in which all customers wait for a single server to serve everyone ahead 
of them in the line and then they get served. Suppose that n customers arrive at a 
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3.7 Multivariate Distributions 155 


single-server queue for service. Let X; be the time that the server spends serving 
customer i fori =1,...,. We might use a joint distribution for X = (Xj, ..., X,) 
with joint p.d.f. of the form 


7 for all x; > 0, 


fa=1 025 (3.7.3) 


i=l “1 
0 otherwise. 


We shall now find the value of c such that the function in Eq. (3.7.3) is a joint p.d-f. 


We can do this by integrating over each variable x, ..., x, in succession (starting 
with x,,). The first integral is 
[o,@)} 
/ dx, = _ (3.7.4) 
0 Atxy te +x) (2+ xp + %q—1)" 


The right-hand side of Eq. (3.7.4) is in the same form as the original p.d.f. except 
that n has been reduced to n — 1 and c has been divided by n. It follows that when 
we integrate over the variable x; (fori =n —1,n—2,..., 1), the result will be in 
the same form with n reduced to i — 1 and c divided by n(n — 1) -- -i. The result of 
integrating all coordinates except x; is then 


! 
Gms _.. for x; > 0. 
(2 + x4)? 
Integrating x, out of this yields c/[2(n!)], which must equal 1, so c = 2(n!). < 


Mixed Distributions 


Arrivals at a Queue. In Example 3.7.5, we introduced the single-server queue and 
discussed service times. Some features that influence the performance of a queue are 
the rate at which customers arrive and the rate at which customers are served. Let Z 
stand for the rate at which customers are served, and let Y stand for the rate at which 
customers arrive at the queue. Finally, let W stand for the number of customers that 
arrive during one day. Then W is discrete while Y and Z could be continuous random 
variables. A possible joint p.f./p.d.f. for these three random variables is 


i a aie | 6e~ 32-10 (8y)”/w! for z, ie Oandw=0,1,..., 
0 otherwise. 
We can verify this claim shortly. 4 
Joint p.f./p.d.f. Let X,,..., X, be random variables, some of which have a continuous 


joint distribution and some of which have discrete distributions; their joint distribu- 
tion would then be represented by a function f that we call the joint p.f/p.d.f. The 
function has the property that the probability that X lies in a subset C C R” is calcu- 
lated by summing f(x) over the values of the coordinates of x that correspond to the 
discrete random variables and integrating over those coordinates that correspond to 
the continuous random variables for all points x € C. 


Arrivals ata Queue. We shall now verify that the proposed p.f./p.d-f. in Example 3.7.6 
actually sums and integrates to 1 over all values of (y, z, w). We must sum over w 
and integrate over y and z. We have our choice of in what order to do them. It is not 
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Example 
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difficult to see that we can factor f as f(y, z, w) = /2(z)h13(y, w), where 


ho(z) = | 6e-** for z > 0, 
- 0 otherwise, 
en | e 0 (8y)”’/w! fory>Oandw=0,1,..., 
, 0 otherwise. 


So we can integrate z out first to get 


lo) CO 
i f(y, Zz, w)dz = hy3(y, w) / 6e*dz = 2h,3(y, w). 
—0oo 0 


Integrating y out of /43(y, w) is possible, but not pleasant. Instead, notice that 
(8y)”/w! is the wth term in the Taylor expansion of e®”. Hence, 


[o,e) [o,e) (8 yw 

Dd 2hra(y, w) = 260” D | OT = e710 e8¥ — 20-2), 
w! 

w=0 w=0 


for y > 0 and 0 otherwise. Finally, integrating over y yields 1. < 


A Clinical Trial. In Example 3.7.1, one of the random variables P has a continuous 
distribution, and the others Xj, ..., X,,, have discrete distributions. A possible joint 


p.f./p.d-f. for (X7,..., Xm, P) is 
| pir t%m(1 — py" Xm for all x; € {0, 1} and 0 < p <1, 
otherwise. 


ff, p)= 


We can find probabilities based on this function. Suppose, for example, that we want 
the probability that there is exactly one success among the first two patients, that is, 
Pr(X, + X> = 1). We must integrate f(x, p) over p and sum over all values of x that 
have x, + x2 = 1. For purposes of illustration, suppose that m = 4. First, factor out 
p*t2(1 — p)?1-*2 = p(1 — p), which yields 


SM, P) = [pa — p)|p3rs( _ py Or, 


for x3, x4 € {0, 1},0 < p <1, and x; + x. = 1. Summing over x3 yields 
[pd — p)] (pa — p)'*4(1— p) + pp™4(1 - py) =[p(1— p)]p4(1 — p)'™. 


Summing this over x4 gives p(1 — p). Next, integrate over p to get iG pl — p)dp = 
1/6. Finally, note that there are two (x1, x2) vectors, (1, 0) and (0, 1), that have 
xy + x) =1,s0 Pr(X; + X2. = 1) = (1/6) + (1/6) = 1/3. < 


Marginal Distributions 


Deriving a Marginal p.d.f. If the joint distribution of n random variables X),..., 
X,, is known, then the marginal distribution of each single random variable X; can 
be derived from this joint distribution. For example, if the joint p.d-f. of X1,..., X, 
is f, then the marginal p.d-f. f; of X1 is specified at every value x, by the relation 


Ai@yp = i. a sis SF (%, cre) Xn) dx ee dX). 
OO 
n—1 


More generally, the marginal joint p.d.f. of any k of the n random variables 
X,,..., X, can be found by integrating the joint p.d.f. over all possible values of 


Example 
3.7.9 


Example 
3.7.10 
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the other n — k variables. For example, if f is the joint p.d.f. of four random variables 
X1, X2, X3, and X4, then the marginal bivariate p.d-f. fo4 of X2 and X4 is specified at 
each point (x3, x4) by the relation 


foe) CO 
fog(X2, X4) = i. / f (X14, X2, X3, X4) dxy dx3. 
—oo J —00 


Service Times in a Queue. Suppose that n = 5 in Example 3.7.5 and that we want the 
marginal bivariate p.d.f. of (X,, X4). We must integrate Eq. (3.7.3) over x9, x3, and xs. 
Since the joint p.d.f. is symmetric with respect to permutations of the coordinates of 
x, we shall just integrate over the last three variables and then change the names of 
the remaining variables to x, and x4. We already saw how to do this in Example 3.7.5. 
The result is 


4 


fia, %2) = 4 (24x, +.x5)3 
0 otherwise. 


for X1, x2 > 0, (3.7.5) 


Then f(4 is just like (3.7.5) with all the 2 subscripts changed to 4. The univariate 
marginal p.d.f. of each X; is 


for x; > 0, 


2 
fii) = 4 (24+-x;)2 (3.7.6) 


0 otherwise. 
So, for example, if we want to know how likely it is that a customer will have to wait 


longer than three time units, we can calculate Pr(X; > 3) by integrating the function 
in Eq. (3.7.6) from 3 to oo. The result is 0.4. < 


If n random variables X;,..., X,, have a discrete joint distribution, then the 
marginal joint p.f. of each subset of the n variables can be obtained from relations 
similar to those for continuous distributions. In the new relations, the integrals are 
replaced by sums. 


Deriving a Marginal e.d.f. Consider now a joint distribution for which the joint 


c.d.f. of X1,..., X, is F. The marginal c.d-f. F, of X, can be obtained from the 
following relation: 
F(x) = Pr(x, <x)= Pr(Xxy <x, X2 << 00,..., X, < 00) 
= Im F(x, %,...,X%,). 
Xs veeg Xp OO 


Failure Times. We can find the marginal c.d.f. of X, from the joint c.d.f. in Exam- 
ple 3.7.2 by letting x. and x3 go to oo. The limit is Fy (x) = 1— e~! for x; > 0 and 0 
otherwise. <l 


More generally, the marginal joint c.d.f. of any k of the n random variables 
X,,..., X, can be found by computing the limiting value of the n-dimensional c.d.f. 
F as x; — oo for each of the other n — k variables x ;. For example, if F is the joint 
c.d.f. of four random variables X;, X2, X3, and X4, then the marginal bivariate c.d.f. 


Fy4 of X> and X4 is specified at every point (x2, x4) by the relation 


Fo4(x, x4) = lim F(x, X72, X3, X4). 
X1,X3— 00 
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Failure Times. We can find the marginal bivariate c.d.f. of X; and X3 from the joint 
c.d.f. in Example 3.7.2 by letting x2 go to oo. The limit is 
_ 4-x _ 933 P 
Fi3Q1, 3) = | eee ee fis = 
0 otherwise. < 


Independent Random Variables 
Independent Random Variables. It is said that n random variables Xj, .. 
independent if, for every n sets Aj, Az, ..., A, of real numbers, 
Pr(X, € Aj, Xo € Ad, eer XxX), e A,) 
= Pr(xy E Aj) Pr(x, S A) ale PrcX,, € A,)- 


., X, are 


If X;,..., X, are independent, it follows easily that the random variables in every 
nonempty subset of X,,..., X, are also independent. (See Exercise 11.) 
There is a generalization of Theorem 3.5.4. 


Let F denote the joint c.d.f. of X;,..., X,, and let F; denote the marginal univariate 
c.d.f. of X; fori =1,...,n. The variables X,,..., X,, are independent if and only if, 
for all points (14, x9, ...,%,) € R", 


F(X, X2, ++ 5 Xn) = Fea) Fo) - + + Fy @y)- a 
Theorem 3.7.3 says that Xj, ..., X, are independent if and only if their joint c.d-f. 
is the product of their n individual marginal c.d.f.’s. It is easy to check that the three 
random variables in Example 3.7.2 are independent using Theorem 3.7.3. 

There is also a generalization of Corollary 3.5.1. 


If X;,..., X, have a continuous, discrete, or mixed joint distribution for which the 
joint p.d.f., joint p.f., or joint p.f'/p.d.f. is f, andif f; is the marginal univariate p.d.f. or 
p.f. of X; @ =1,...,n), then X;,..., X, are independent if and only if the following 
relation is satisfied at all points (x1, x5,...,x,) € R": 


) Xn) = AiO) fo) ane Fi Sad- (3.7.7) 


St (1, XQ, 00s 


Service Times in a Queue. In Example 3.7.9, we can multiply together the two uni- 
variate marginal p.d.f’s of X; and X> calculated using Eq. (3.7.6) and see that the 
product does not equal the bivariate marginal p.d.f. of (X;, X2) in Eq. (3.7.5). So X1 
and X, are not independent. | 


Random Samples/i.i.d./Sample Size. Consider a given probability distribution on the 
real line that can be represented by either a pf. or a p.d.f. f. It is said that n 
random variables X;,..., X, form a random sample from this distribution if these 
random variables are independent and the marginal p.f. or p.d.f. of each of them is 
f. Such random variables are also said to be independent and identically distributed, 
abbreviated i.i.d. We refer to the number n of random variables as the sample size. 


Definition 3.7.6 says that X;,..., X, form a random sample from the distribution 
represented by f if their joint pf. or p.d-f. g is specified as follows at all points 
(x1, X2,---,X,) € R”: 


> Xn) = f (x1) f x2) le f (Xp). 


Clearly, an i.i.d. sample cannot have a mixed joint distribution. 


g(xy,.-. 
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Lifetimes of Light Bulbs. Suppose that the lifetime of each light bulb produced in a 
certain factory is distributed according to the following p.d.f.: 


xe * forx >0, 
ao | 0 otherwise. 
We shall determine the joint p.d.f. of the lifetimes of a random sample of n light bulbs 
drawn from the factory’s production. 

The lifetimes X;,..., X,, of the selected bulbs will form a random sample from 
the p.d.f. f. For typographical simplicity, we shall use the notation exp(v) to denote 
the exponential e” when the expression for v is complicated. Then the joint p.d.f. g 
of X,,..., X,, will be as follows: If x; > 0 fori =1,...,n, 


RO 3X) = Il St (x) 
i=l 


= (1 “| exp (- > “| . 
i=1 i=1 
Otherwise, g(x;,...,x,) =0. 


Every probability involving the n lifetimes Xj, ..., X,, can in principle be deter- 
mined by integrating this joint p.d.f. over the appropriate subset of R”. For example, if 
C is the subset of points (x1, ..., x,) such that x; > Ofori =1,...,n and )7"_, x; <a, 
where a is a given positive number, then 


m(Sxea)efief (ls)em(-Ss)anote 4 


The evaluation of the integral given at the end of Example 3.7.13 may require 
a considerable amount of time without the aid of tables or a computer. Certain 
other probabilities, however, can be evaluated easily from the basic properties of 
continuous distributions and random samples. For example, suppose that for the 
conditions of Example 3.7.13 it is desired to find Pr(X, < X2 <--- < X,,). Since the 
random variables X;,..., X, have a continuous joint distribution, the probability 
that at least two of these random variables will have the same value is 0. In fact, 
the probability is 0 that the vector (X,,..., X,,) will belong to each specific subset 
of R” for which the n-dimensional volume is 0. Furthermore, since X;,..., X, are 
independent and identically distributed, each of these variables is equally likely to 
be the smallest of the n lifetimes, and each is equally likely to be the largest. More 
generally, if the lifetimes X,,..., X,, are arranged in order from the smallest to the 
largest, each particular ordering of X,,..., X,, is as likely to be obtained as any 
other ordering. Since there are n! different possible orderings, the probability that 
the particular ordering X; < X> <--- < X,, will be obtained is 1/n!. Hence, 

1 


Pr(Xy < Xz <--> <X,)=—. 
n! 


Conditional Distributions 


Suppose that n random variables X;,..., X,, have a continuous joint distribution for 
which the joint p.d-f. is f and that fp denotes the marginal joint p.d-f. of the k < n ran- 
dom variables X;,..., X,. Then for all values of x1, ..., x, such that fo(x1,...,%,) > 
0, the conditional p.d.f. of (X,.4;,..., X,) given that X;=x,..., X, =x, is defined 
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as follows: 


F 4 X85 0 ees Xn) 
fo, ee XK) 


Sk41.n Ck se Xn|%1, cae) Xx) = 


The definition above generalizes to arbitrary joint distributions as follows. 


Definition | Conditional p.f., p.d.f., orp.f./p.d.f. Suppose that the random vector X = (X1,..., X,) 
3.7.7 is divided into two subvectors Y and Z, where Y is a k-dimensional random vector 
comprising k of the n random variables in X, and Z is an (n — k)-dimensional random 
vector comprising the other n —k random variables in X. Suppose also that the 
n-dimensional joint p.f., p.d.f, or p.f./p.d.f. of (Y, Z) is f and that the marginal (n — k)- 
dimensional p.f., p.d.f., or p.f£./p.d.f. of Z is fy. Then for every given point z € R’~* such 
that f5(z) > 0, the conditional k-dimensional p.f., p.d.f., or p.f/p.d.f. g, of Y given 

Z =z is defined as follows: 


gi(ylzZ) = LY2) tor ye R*. (3.7.8) 
f2@) 
Eq. (3.7.8) can be rewritten as 
F(Y, 2) = 810912) fo), (3.7.9) 


which allows construction of the joint distribution from a conditional distribution and 
a marginal distribution. As in the bivariate case, it is safe to assume that f(y, z) =0 
whenever f>(z) = 0. Then Eq. (3.7.9) holds for all y and z even though g;(y|z) is not 


uniquely defined. 
Example Service Times in a Queue. In Example 3.7.9, we calculated the marginal bivariate 
3.7.14 distribution of two service times Z = (X,, X,). We can now find the conditional three- 


dimensional p.d.f. of Y = (X3, X4, Xs) given Z = (x1, x2) for every pair (x1, x2) such 
that X1, x2 > 0: 


F Oy «++ 5X5) 


fi2(%1, x2) 


-( 240 )( 4 y 
4a te +x5)8/ (2 +21 + x2) 


_ 60(2 + x1 + x»)? 
(244, +++++x5)° 


81(%3, X4, X5|X1, X2) = 


(3.7.10) 


for x3, x4, x5 > 0, and 0 otherwise. The joint p.d.f. in (3.7.10) looks like a bunch of 
symbols, but it can be quite useful. Suppose that we observe X; = 4 and X, = 6. Then 


103,680 
81(%3, X4, X5/4.6) = 9 (12 + x3 + x4 + x5)° 
0 otherwise. 


for x3, x4, x5 > 0, 


We can now calculate the conditional probability that X3 > 3 given X; =4, X, =6: 
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[o,e) [o,e) [o,e) 
Pr(X3 > 3|X1 = 4, XxX) = 6) = , i, [ Det dxsdx4dx3 
(12 + x3 + x4 + x5)® 


ay [ 20,736 OF te, 
(12 + x3 + x4)? 
[oe] 
= / eS 
3 (12+.3)4 


metas = 0.512. 
15° 


Compare this to the calculation of Pr(X3 > 3) =0.4 at the end of Example 3.7.9. 
After learning that the first two service times are a bit longer than three time units, we 
revise the probability that X3 > 3 upward to reflect what we learned from the first two 
observations. If the first two service times had been small, the conditional probability 
that X3 > 3 would have been smaller than 0.4. For example, Pr(X3 > 3|X; =1, X. = 
1.5) = 0.216. < 


Determining a Marginal Bivariate p.d.f. Suppose that Z is a random variable for which 
the p.d.f. fo is as follows: 


—2z 

fo(z) = | Be TOE i711) 
0 otherwise. 

Suppose, furthermore, that for every given value Z = z > 0 two other random vari- 

ables X, and X are independent and identically distributed and the conditional p.d.f. 

of each of these variables is as follows: 


e(sle) =| for x > 0, 
0 otherwise. 


We shall determine the marginal joint p.d-f. of (X1, X>). 
Since X, and X> are i.i.d. for each given value of Z, their conditional joint p.d-f. 
when Z =z > Ois 


(3.7.12) 


812(%4, x a= {ren for x1, x2 > 0, 
12%, *2|Z) = 


otherwise. 
The joint p.d.f. f of (Z, X1, X) will be positive only at those points (z, x, x2) 
such that x1, x2, z > 0. It now follows that, at every such point, 


F(Z, x4, 2) = fol) e121, x2|z) = 2272 2C 2) 


For x; >0 and x» >0, the marginal joint p.d.f. fio, x2) of X, and X> can be 
determined either using integration by parts or some special results that will arise 
in Sec. 5.7: 


cane 

(24x, +25)’ 

for x1, x. > 0. The reader will note that this p.d_-f. is the same as the marginal bivariate 
p.d.f. of (X,, X>) found in Eq. (3.7.5). 


From this marginal bivariate p.d.f., we can evaluate probabilities involving X, 
and X>, such as Pr(X, + X, < 4). We have 


fi2(%1, X2) = [ f (Z, X41, X2) dz= 


P 4 —- : : 
1(X +x<h= f/f a < 
: . 0 Jo (2 + x1 + x2)3 poe 
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Service Times in a Queue. We can think of the random variable Z in Example 3.7.15 
as the rate at which customers are served in the queue of Example 3.7.5. With this 
interpretation, it is useful to find the conditional distribution of the rate Z after we 
observe some of the service times such as X; and X>. 

For every value of z, the conditional p.d-f. of Z given X; = x, and X, = x, is 


f (Z, x4, X2) 
80(Zlx1, x2) = ————— 
fi2%1, x2) 
_ 5(2 +44,+ X9)°22e 22m +22) for z > 0, (3.7.13) 
0 otherwise. 
Finally, we shall evaluate Pr(Z < 1|X,; =1, X> =4). We have 
1 
Pr(Z < 1|X, = 1, x,=4)= [ go(zl1, 4) dz 
0 
1 
= / 171.527e~ 7? dz = 0.9704. < 
0 


Law of Total Probability and Bayes’ Theorem Example 3.7.15 contains an example 
of the multivariate version of the law of total probability, while Example 3.7.16 
contains an example of the multivariate version of Bayes’ theorem. The proofs of 
the general versions are straightforward consequences of Definition 3.7.7. 


Multivariate Law of Total Probability and Bayes’ Theorem. Assume the conditions and 
notation given in Definition 3.7.7. If Z has a continuous joint distribution, the mar- 
ginal p.d.f. of Y is 


Aly) = i on / ei(ylz) fo(z) dz, (3.7.14) 
—0o —00 


n—k 
and the conditional p.d.f. of Z given Y = y is 


fity) 


If Z has a discrete joint distribution, then the multiple integral in (3.7.14) must be 
replaced by a multiple summation. If Z has a mixed joint distribution, the multiple 
integral must be replaced by integration over those coordinates with continuous 
distributions and summation over those coordinates with discrete distributions. m 


ga(zly) = (3.7.15) 


Conditionally Independent Random Variables In Examples 3.7.15 and 3.7.16, Z is 
the single random variable Z and Y = (X,, X>). These examples also illustrate the use 
of conditionally independent random variables. That is, X; and X> are conditionally 
independent given Z =z for all z > 0. In Example 3.7.16, we said that Z was the 
rate at which customers were served. When this rate is unknown, it is a major source 
of uncertainty. Partitioning the sample space by the values of the rate Z and then 
conditioning on each value of Z removes a major source of uncertainty for part of 
the calculation. 

In general, conditional independence for random variables is similar to condi- 
tional independence for events. 


Definition 
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Conditionally Independent Random Variables. Let Z be a random vector with joint 
p.f, p.d.f, or p.f./p.d.f. fo(z). Several random variables X;,..., X, are conditionally 
independent given Z if, for all z such that fo(z) > 0, we have 


n 
g(xlz) =] | siGilz), 
i=1 
where g(x|z) stands for the conditional multivariate p.f., p.d.f, or p.f/p.d-f. of X given 
Z =zand g;(x;|z) stands for the conditional univariate p.f. or p.d.f. of X; given Z = z. 


In Example 3.7.15, g;(x;|z) = ze~* for x; > O andi = 1, 2. 


A Clinical Trial. In Example 3.7.8, the joint p.f./p.d-f. given there was constructed by 
assuming that X,,..., X,, were conditionally independent given P = p each with 
the same conditional p.f., g;(x;|p) = p*'\d — p)' for x; € {0, 1} and that P had 
the uniform distribution on the interval [0, 1]. These assumptions produce, in the 
notation of Definition 3.7.8, 


prt tim(L — p)f0-x1--"—4m for all x; € {0, 1} and 0 < p <1, 
0 otherwise, 


g(x|p) = 


for 0 < p < 1. Combining this with the marginal p.d.f. of P, fo(p) =1 for0<p<1 
and 0 otherwise, we get the joint p.f./p.d.-f. given in Example 3.7.8. < 


Conditional Versions of Past and Future Theorems We mentioned earlier that 
conditional distributions behave just like distributions. Hence, all theorems that we 
have proven and will prove in the future have conditional versions. For example, 
the law of total probability in Eq. (3.7.14) has the following version conditional on 
another random vector W = w: 


[o,@) [o,2)} 
fiQylw) =| ae j 8i( IZ, W) fo(zlw) dz, (3.7.16) 

—0o —oo 

n—k 
where /;(y|w) stands for the conditional p.d.f., p.f., or p.f/p.d-f. of Y given W = wn, 
g,(y|z, w) stands for the conditional p.d.f., p.f., or p.f/p.d-f. of Y given (Z, W) = (z, w), 
and f>(z|w) stands for the conditional p.d.f. of Z given W = w. Using the same 
notation, the conditional version of Bayes’ theorem is 


ei( yz, Ww) fo(zlw) (3.7.17) 
fi(y|) 


g2(Zly, w) = 


Conditioning on Random Variables in Sequence. In Example 3.7.15, we found the 
conditional p.d.f. of Z given (X,, X>) = (x1, x2). Suppose now that there are three 
more observations available, X3, X4, and X5, and suppose that all of X;,..., X5 
are conditionally iid. given Z =z with p.d.f. g(x|z). We shall use the conditional 
version of Bayes’ theorem to compute the conditional p.d.f. of Z given (Xj, ..., Xs) = 
(x1, ..., 5). First, we shall find the conditional p.d-f. g345(x3, X4, X5|x1, X2, z) of Y= 
(X3, X4, X5) given Z =z and W = (Xj, X>) = (41, x2). We shall use the notation for 
p.d.f’s in the discussion immediately preceding this example. Since X,..., X5 are 
conditionally i.i.d. given Z, we have that g)(y|z, w) does not depend on w. In fact, 


gi(ylz, w) = 9(x3|z)e(xglz)g(xs|z) = ze 2349) 
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for x3, x4, x5 > 0. We also need the conditional p.d.f. of Z given W = w, which was 
calculated in Eq. (3.7.13), and we now denote it 


fo(z|w) = 502 hey a xp)oz2e 22 tat H2) 


Finally, we need the conditional p.d.f. of the last three observations given the first 
two. This was calculated in Example 3.7.14, and we now denote it 


60(2 + x1 + X9)° 
(2+xy +++ +25) 


Now combine these using Bayes’ theorem (3.7.17) to obtain 


fiCy|w) = 


pe 7Z3txgty5) 5 (a+ X9)3z2e722+a1 +42) 


ZY,W)= 
an 60(2 + x4 + x2) 
(2+ x, +--+ +25)® 
1 : 
= I? + x1 hess Say ee COE) 
for z > 0. < 


Note: Simple Rule for Creating Conditional Versions of Results. If you ever wish to 
determine the conditional version given W = w ofa result that you have proven, here 
is a simple method. Just add “conditional on W = w” to every probabilistic statement 
in the result. This includes all probabilities, c.d.f’s, quantiles, names of distributions, 
p.d.f’s, p.f’s, and so on. It also includes all future probabilistic concepts that we 
introduce in later chapters (such as expected values and variances in Chapter 4). 


Note: Independence is a Special Case of Conditional Independence. Let X,,..., 
X, be independent random variables, and let W be a constant random variable. 
That is, there is a constant c such that Pr(W =c) =1. Then X),..., X,, are also 
conditionally independent given W = c. The proof is straightforward and is left to 
the reader (Exercise 15). This result is not particularly interesting in its own right. 
Its value is the following: If we prove a result for conditionally independent random 
variables or conditionally i.i.d. random variables, then the same result will hold for 
independent random variables or i.i.d. random variables as the case may be. 


Histograms 


Rate of Service. In Examples 3.7.5 and 3.7.6, we considered customers arriving at a 
queue and being served. Let Z stand for the rate at which customers were served, 


and we let X;, X>, ... stand for the times that the successive customers requrired for 
service. Assume that X;, X>, ... are conditionally i.i.d. given Z = z with p.d-f. 
—ZXx 
ee | ze for x on (3.7.18) 
0 otherwise. 


This is the same as (3.7.12) from Example 3.7.15. In that example, we modeled Z as 
a random variable with p.d.f. f(z) = 2 exp(—2z) for z > 0. In this example, we shall 
assume that X,,..., X,, will be observed for some large value n, and we want to 
think about what these observations tell us about Z. To be specific, suppose that we 
observe n = 100 service times. The first 10 times are listed here: 


1.39, 0.61, 2.47, 3.35, 2.56, 3.60, 0.32, 1.43, 0.51, 0.94. 


Definition 
3.7.9 


Example 
3.7.20 


Figure 3.22 Histogram 

of service times for Exam- 
ple 3.7.20 with a = 0, b = 10, 
k =10, and r = 100. 
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The smallest and largest observed service times from the entire sample are 0.004 and 
9.60, respectively. It would be nice to have a graphical display of the entire sample 
of n = 100 service times without having to list them separately. < 


The histogram, defined below, is a graphical display of a collection of numbers. 
It is particularly useful for displaying the observed values of a collection of random 
variables that have been modeled as conditionally i.i.d. 


Histogram. Let x1, ..., x, be a collection of numbers that all lie between two values 
a <b. That is, a <x; <b for alli =1,...,n. Choose some integer k > 1 and divide 
the interval [a, b] into k equal-length subintervals of length (b — a)/k. For each 
subinterval, count how many of the numbers x;,..., x, are in the subinterval. Let 
c; be the count for subinterval i fori = 1, ..., k. Choose a number r > 0. (Typically, 
r=lorr=norr =n(b —a)/k.) Draw a two-dimensional graph with the horizonal 
axis running from a to b. For each subinterval i = 1, ..., & draw a rectangular bar of 
width (b — a)/k and height equal to c;/r over the midpoint of the ith interval. Such 
a graph is called a histogram. 


The choice of the number r in the definition of histogram depends on what one 
wishes to be displayed on the vertical axis. The shape of the histogram is identical 
regardless of what value one chooses for r. Withr = 1, the height of each bar is the raw 
count for each subinterval, and counts are displayed on the vertical axis. With r =n, 
the height of each bar is the proportion of the set of numbers in each subinterval, 
and the vertical axis displays proportions. With r = n(b — a)/k, the area of each bar 
is the proportion of the set of numbers in each subinterval. 


Rate of Service. The n = 100 observed service times in Example 3.7.19 all lie between 
0 and 10. It is convenient, in this example, to draw a histogram with horizontal axis 
running from 0 to 10 and divided into 10 subintervals of length 1 each. Other choices 
are possible, but this one will do for illustration. Figure 3.22 contains the histogram of 
the 100 observed service times with r = 100. One sees that the numbers of observed 
service times in the subintervals decrease as the center of the subinterval increses. 
This matches the behavior of the conditional p.d.f. g(x|z) of the service times as a 
function of x for fixed z. 4 


Histograms are useful as more than just graphical displays of large sets of num- 
bers. After we see the law of large numbers (Theorem 6.2.4), we can show that the 


Proportion 


histogram of a large (conditionally) i.i.d. sample of continuous random variables is 
an approximation to the (conditional) p.d.f. of the random variables in the sample, 
so long as one uses the third choice of r, namely, r = n(b — a)/k. 


Note: More General Histograms. Sometimes it is convenient to divide the range of 
the numbers to be plotted in a histogram into unequal-length subintervals. In such a 
case, one would typically let the height of each bar be c;/r;, where c; is the raw count 
and r; is proportional to the length of the ith subinterval. In this way, the area of each 
bar is still proportional to the count or proportion in each subinterval. 


A finite collection of random variables is called a random vector. We have defined 
joint distributions for arbitrary random vectors. Every random vector has a joint c.d_f. 
Continuous random vectors have a joint p.d.f. Discrete random vectors have a joint 
p.f£. Mixed distribution random vectors have a joint p.f./p.d.f. The coordinates of an 
n-dimensional random vector X are independent if the joint p.f, p.d-f, or p.f/p.d-f. 


We can compute marginal distributions of subvectors of a random vector, and 
we can compute the conditional distribution of one subvector given the rest of the 
vector. We can construct a joint distribution for a random vector by piecing together 
a marginal distribution for part of the vector and a conditional distribution for the 
rest given the first part. There are versions of Bayes’ theorem and the law of total 


An n-dimensional random vector X has coordinates that are conditionally inde- 
pendent given Z if the conditional p.f., p.d-f£, or p.f/p.d-f. g(x|z) of X given Z =z 
factors into [}_, g;(x;|z). There are versions of Bayes’ theorem, the law of total 
probability, and all future theorems about random variables and random vectors 
conditional on an arbitrary additional random vector. 
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Summary 
f (x) factors into []?_, f;(x;). 
probability for random vectors. 
i 
Exercises 


1. Suppose that three random variables X;, Xz, and X3 
have a continuous joint distribution with the following 
joint p.d.-f: f(xy, x2, x3) = 


c(xy + 2x9 +3x3) forO<x,; <1 G =1, 2, 3), 


0 otherwise. 


Determine (a) the value of the constant c; 
(b) the marginal joint p.d.f. of X; and X3; and 


(c) Pr (x3 < |X = is XxX) = 3). 


2. Suppose that three random variables X,, X7, and X3 
have a mixed joint distribution with p.f/p.d-f: 


Sf (X1, X2, ¥3) 


cx, 73 — x) 3 if0<x, <1 
= and x9, x3 € {0, 1}, 


0 otherwise. 


(Notice that X, has a continuous distribution and X and 
X3 have discrete distributions.) Determine (a) the value of 
the constant c; (b) the marginal joint p.f. of X> and X3; and 
(c) the conditional p.d-f. of X; given X, = 1 and X3=1. 


3. Suppose that three random variables X), X7, and X3 
have a continuous joint distribution with the following 
joint p.d.f: f(xy, x2, x3) = 

| ce 1+2%2+3%3) for x; > 0 @ =1, 2, 3), 


0 otherwise. 


Determine (a) the value of the constant c; (b) the marginal 
joint p.d-f. of X; and X3; and (ce) Pr(X, < 1|X,=2, X3=1). 


4. Suppose that a point (X,, X>, X3) is chosen at random, 
that is, in accordance with the uniform p.d.f., from the 
following set S: 


S = {(x1, X2, x3):0 <x; <1 fori =1, 2, 3}. 


Determine: 


. p(x 4) +(x — 3) +(x -4) <4] 
b. Pr(X7 + X5 + X3 <1) 


5. Suppose that an electronic system contains n compo- 
nents that function independently of each other and that 
the probability that component i will function properly is 
p; ( =1,...,n). It is said that the components are con- 
nected in series if a necessary and sufficient condition for 
the system to function properly is that all n components 
function properly. It is said that the components are con- 
nected in parallel if a necessary and sufficient condition for 
the system to function properly is that at least one of the 
n components functions properly. The probability that the 
system will function properly is called the reliability of the 
system. Determine the reliability of the system, (a) assum- 
ing that the components are connected in series, and (b) 
assuming that the components are connected in parallel. 


6. Suppose that the n random variables X,..., X,, forma 
random sample from a discrete distribution for which the 
p.f. is f. Determine the value of Pr(X; = X27 =:-:-=X,). 


7. Suppose that then random variables X1,..., X,, forma 
random sample from a continuous distribution for which 
the p.d-f. is f. Determine the probability that at least k 
of these n random variables will lie in a specified interval 
a<x<b. 


8. Suppose that the p.d.f. of a random variable X is as 
follows: 


x 


oxen forx >0 


r= | 
0 


otherwise. 


Suppose also that for any given value X = x (x > 0), then 
random variables Y;,..., Y, are 1.i.d. and the conditional 
p.d.f. g of each of them is as follows: 


: 7 
com=| x forO<y <x, 


0 otherwise. 


Determine (a) the marginal joint p.d-f. of Yj,..., Y, and 
(b) the conditional p.d-f. of X for any given values of 
eee oe 
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9, Let X be arandom variable with a continuous distribu- 
tion. Let X; = X,= X. 
a. Prove that both X; and X, have a continuous distri- 
bution. 


b. Prove that X¥ = (Xj, X2) does not have a continuous 
joint distribution. 


10. Return to the situation described in Example 3.7.18. 
Let X = (Xj, ..., X5) and compute the conditional p.d.f. 
of Z given X =x directly in one step, as if all of X were 
observed at the same time. 


11. Suppose that X,..., X, are independent. Let k <n 
and let ij,..., i, be distinct integers between 1 and n. 
Prove that X;,,..., X;, are independent. 

12. Let X be arandom vector that is split into three parts, 
X =(Y, Z, W). Suppose that X has a continuous joint 
distribution with p.d.f. f(y, z, w). Let g)(y, z|w) be the 
conditional p.d.f. of (Y, Z) given W = w, and let go(y|w) 
be the conditional p.d.f. of Y given W = w. Prove that 
go(ylw) = f ai(y, Zlw) dz. 


13. Let X 1, X2, X3 be conditionally independent given 
Z =z for all z with the conditional p.d-f. g(x|z) in Eq. 
(3.7.12). Also, let the marginal p.d.f. of Z be fo in 
Eq. (3.7.11). Prove that the conditional p.d.f. of X3 given 


(X1, Xo) = (1, x2) is [ g(xglz)go(zln1, x2) dz, where go is 
defined in Eq. (3.7.13). (You can prove this even if you 
cannot compute the integral in closed form.) 


14. Consider the situation described in Example 3.7.14. 
Suppose that X; = 5 and Xz =7 are observed. 


a. Compute the conditional p.d.f. of X3 given (X1, X7) = 
(5, 7). (You may use the result stated in Exercise 12.) 


b. Find the conditional probability that X3; > 3 given 
(X1, X>) = (5, 7) and compare it to the value of 
Pr(X3 > 3) found in Example 3.7.9. Can you suggest 
a reason why the conditional probability should be 
higher than the marginal probability? 


15. Let X,,..., X, be independent random variables, and 
let W be a random variable such that Pr(W =c) = 1 for 
some constant c. Prove that X1,..., X, are conditionally 
independent given W =c. 


3.8 Functions of a Random Variable 


Often we find that after we compute the distribution of a random variable X, we 
really want the distribution of some function of X. For example, if X is the rate at 
which customers are served in a queue, then 1/X is the average waiting time. If we 
have the distribution of X, we should be able to determine the distribution of 1/X 
or of any other function of X. How to do that is the subject of this section. 
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Example 
3.8.1 


Theorem 
3.8.1 


Example 


3.8.2 


Example 
3.8.3 


Random Variable with a Discrete Distribution 


Distance from the Middle. Let X have the uniform distribution on the integers 


1,2, ..., 9. Suppose that we are interested in how far X is from the middle of the 
distribution, namely, 5. We could define Y = |X — 5| and compute probabilities such 
as Pr(Y = 1) = Pr(X € {4, 6}) =2/9. < 


Example 3.8.1 illustrates the general procedure for finding the distribution of a 
function of a discrete random variable. The general result is straightforward. 


Function of a Discrete Random Variable. Let X have a discrete distribution with p.f. f, 
and let Y =r(X) for some function of r defined on the set of possible values of X. 
For each possible value y of Y, the p.f. g of Y is 


go) =Pr(¥ =y)=Pr[r(X)=yJ= DD fd. 7 


xir(x)=y 


Distance from the Middle. The possible values of Y in Example 3.8.1 are 0, 1, 2, 3, 
and 4. We see that Y = 0 if and only if X =5, so g(0) = f(5) = 1/9. For all other 
values of Y, there are two values of X that give that value of Y. For example, 
{Y =4} ={X =1} U {X = 9}. So, g(y) =2/9 for y = 1, 2, 3, 4. < 


Random Variable with a Continuous Distribution 


Ifarandom variable X has a continuous distribution, then the procedure for deriving 
the probability distribution of a function of X differs from that given for a discrete 
distribution. One way to proceed is by direct calculation as in Example 3.8.3. 


Average Waiting Time. Let Z be the rate at which customers are served in a queue, 
and suppose that Z has a continuous c.d.f. F. The average waiting time is Y = 1/Z. 
If we want to find the c.d.f. G of Y, we can write 


Gy) =P sy) =Pr(F sy) =Pr(zz2)=pr(z>=)=1-F (4), 
Z y y y 


where the fourth equality follows from the fact that Z has a continuous distribution 
so that Pr(Z = 1/y) =0. < 


In general, suppose that the p.d.f. of X is f and that another random variable is 
defined as Y = r(X). For each real number y, the c.d.f. G(y) of Y can be derived as 
follows: 


G(y) =Pr(Y < y)=Pr[r(X) < y] 


= / f(x) dx. 
{xir(x)<y} 


Ifthe random variable Y also has a continuous distribution, its p.d.f. g can be obtained 
from the relation 


_aGiy) 


gy) rr 


This relation is satisfied at every point y at which G is differentiable. 


Figure 3.23 The p.d-f. of 
Y = X? in Example 3.8.4. 


Example 
3.8.4 


Theorem 
3.8.2 


Example 
3.8.5 
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g(y)A 


Deriving the p.d.f. of X? when X Has a Uniform Distribution. Suppose that X has the 
uniform distribution on the interval [—1, 1], so 

1/2 for-1l<x <1, 

0 otherwise. 


jo| 


We shall determine the p.d.f. of the random variable Y = X?. 
Since Y = X?, then Y must belong to the interval 0 < Y < 1. Thus, for each value 
of Y such that 0 < y <1, the c.d.f. G(y) of Y is 


G(y) =Pr(Y < y)=Pr(X?< y) 


= Pr(—y/? < X < y/) 
1/2 


={ f@)dx=y?. 


yl/2 
For 0 < y < 1, it follows that the p.d.f. g(y) of Y is 
_dG(y) 1 
sO) = Poe 


This p.d.f. of Y is sketched in Fig. 3.23. It should be noted that although Y is 
simply the square of a random variable with a uniform distribution, the p.d-f. of Y is 
unbounded in the neighborhood of y = 0. <l 


Linear functions are very useful transformations, and the p.d.f. of a linear func- 
tion of a continuous random variable is easy to derive. The proof of the following 
result is left to the reader in Exercise 5. 


Linear Function. Suppose that X is a random variable for which the p.d.f. is f and that 
Y =aX +b (a 40). Then the p.d-f. of Y is 


sy) = = (=) for —00 < y < 00, (3.8.1) 
a 


|a| 


and 0 otherwise. | 


The Probability Integral Transformation 


Let X be a continuous random variable with p.d.f. f(x) = exp(—x) for x > 0 and 0 
otherwise. The c.d.f. of X is F(x) = 1 — exp(—x) for x > 0 and 0 otherwise. If we let 
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Theorem 
3.8.3 


Corollary 
3.8.1 


F be the function r in the earlier results of this section, we can find the distribution 
of Y = F(X). The c.d.f. or Y is, for 0 < y <1, 


G(y) =Pr(Y < y) = Pr(1 — exp(—X) < y) = Pr(X < — log(1 — y)) 
= F(—log(1 — y)) =1—exp(—[— log(l— y)) =y, 


which is the c.d.f. of the uniform distribution on the interval [0, 1]. It follows that Y 
has the uniform distribution on the interval [0, 1]. < 


The result in Example 3.8.5 is quite general. 


Probability Integral Transformation. Let X have acontinuousc.d.f. F, andlet Y = F(X). 
(This transformation from X to Y is called the probability integral transformation.) 
The distribution of Y is the uniform distribution on the interval [0, 1]. 


Proof First, because F is the c.d.f. of a random variable, then 0 < F(x) <1 for 
—co < x < o. Therefore, Pr(Y < 0) = Pr(Y > 1) =0. Since F is continuous, the set 
of x such that F(x) = y isa nonempty closed and bounded interval [x, x] for each y 
in the interval (0, 1). Let F —lyy) denote the lower endpoint xp of this interval, which 
was called the y quantile of F in Definition 3.3.2. In this way, Y < y if and only if 
X <x,. Let G denote the c.d.f. of Y. Then 


GQ) =Pr(¥ <y) =Pr(X <x) = Fy) = y. 


Hence, G(y) = y for 0 < y < 1. Because this function is the c.d-f. of the uniform 
distribution on the interval [0, 1], this uniform distribution is the distribution of Y. 
| 


Because Pr(X = F~!(Y)) = 1 in the proof of Theorem 3.8.3, we have the following 
corollary. 


Let Y have the uniform distribution on the interval [0, 1], and let F be a continuous 
c.d.f. with quantile function F~!. Then X = F~'(Y) has c.d.f. F. a 


Theorem 3.8.3 and its corollary give us a method for transforming an arbitrary 
continuous random variable X into another random variable Z with any desired 
continuous distribution. To be specific, let X have a continuous c.d.f. F, and let G 
be another continuous c.d.f. Then Y = F(X) has the uniform distribution on the 
interval [0, 1] according to Theorem 3.8.3, and Z = G~!(Y) has the c.d.f. G according 
to Corollary 3.8.1. Combining these, we see that Z = G~'[F(X)] has c.d.f. G. 


Simulation 


Pseudo-Random Numbers Most computer packages that do statistical analyses 
also produce what are called pseudo-random numbers. These numbers appear to 
have some of the properties that a random sample would have, even though they 
are generated by deterministic algorithms. The most fundamental of these programs 
are the ones that generate pseudo-random numbers that appear to have the uniform 
distribution on the interval [0, 1]. We shall refer to such functions as uniform pseudo- 
random number generators. The important features that a uniform pseudo-random 
number generator must have are the following. 

The numbers that it produces need to be spread somewhat uniformly over the 
interval [0, 1], and they need to appear to be observed values of independent random 


Example 
3.8.6 
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variables. This last feature is very complicated to word precisely. An example of a 
sequence that does not appear to be observations of independent random variables 
would be one that was perfectly evenly spaced. Another example would be one with 
the following behavior: Suppose that we look at the sequence X;, X5,... one ata 
time, and every time we find an X; > 0.5, we write down the next number X;,;. If the 
subsequence of numbers that we write down is not spread approximately uniformly 
over the interval [0, 1], then the original sequence does not look like observations 
of independent random variables with the uniform distribution on the interval [0, 1]. 
The reason is that the conditional distribution of X;,; given that X; > 0.5 is supposed 
to be uniform over the interval [0, 1], according to independence. 


Generating Pseudo-Random Numbers Having a Specified Distribution A uniform 
pseudo-random number generator can be used to generate values of a random 
variable Y having any specified continuous c.d.f. G. If a random variable X has the 
uniform distribution on the interval [0, 1] and if the quantile function G~! is defined 
as before, then it follows from Corollary 3.8.1 that the c.d.f. of the random variable 
Y =G~\(X) will be G. Hence, if a value of X is produced by a uniform pseudo- 
random number generator, then the corresponding value of Y will have the desired 
property. If n independent values X,..., X, are produced by the generator, then 
the corresponding values Y;,..., Y, will appear to form a random sample of size n 
from the distribution with the c.d.f. G. 


Generating Independent Values from a Specified p.d.f. Suppose that a uniform pseudo- 
random number generator is to be used to generate three independent values from 
the distribution for which the p.d.f. g is as follows: 


oi 5(2—y) for0<y <2, 
0 otherwise. 


For 0 < y <2, the c.d-f. G of the given distribution is 


y2 


G(y)=y-——. 
W=y= 5 
Also, for 0 <x <1, the inverse function y= G~!(x) can be found by solving the 


equation x = G(y) for y. The result is 
y=G (x) =2f1-d—x)"7]. (3.8.2) 


The next step is to generate three uniform pseudo-random numbers x, x2, and x3 
using the generator. Suppose that the three generated values are 


x, =0.4125, x, =0.0894, x3 = 0.8302. 


When these values of x1, x2, and x3 are substituted successively into Eq. (3.8.2), 
the values of y that are obtained are y; = 0.47, y> = 0.09, and y3 = 1.18. These are 
then treated as the observed values of three independent random variables with the 
distribution for which the p.d.f. is g. J 


If G is a general c.d.f., there is a method similar to Corollary 3.8.1 that can be 
used to transform a uniform random variable into a random variable with c.d.f. G. 
See Exercise 12 in this section. There are other computer methods for generating 
values from certain specified distributions that are faster and more accurate than 
using the quantile function. These topics are discussed in the books by Kennedy and 
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Gentle (1980) and Rubinstein (1981). Chapter 12 of this text contains techniques and 
examples that show how simulation can be used to solve statistical problems. 


General Function In general, if X has a continuous distribution and if Y =r(X), 
then it is not necessarily true that Y will also have a continuous distribution. For ex- 
ample, suppose that r(x) = c, where c is a constant, for all values of x in some interval 
a<x <b,and that Pr(a < X <b) > 0. Then Pr(Y =c) > 0. Since the distribution of Y 
assigns positive probability to the value c, this distribution cannot be continuous. In 
order to derive the distribution of Y in a case like this, the c.d.f. of Y must be derived 
by applying methods like those described above. For certain functions r, however, 
the distribution of Y will be continuous; and it will then be possible to derive the 
p.d.f. of Y directly without first deriving its c.d.f. We shall develop this case in detail 
at the end of this section. 


Direct Derivation of the p.d.f. When r is One-to-One and Differentiable 


Example 
3.8.7 


Theorem 
3.8.4 


Average Waiting Time. Consider Example 3.8.3 again. The p.d.f. g of Y can be com- 
puted from G(y) = 1-— F(1/y) because F and 1/y both have derivatives at enough 
places. We apply the chain rule for differentiation to obtain 


1 1 1 
a (-5s) = (;) 


except at y = 0 and at those values of y such that F(x) is not differentiable at x = 1/y. 
S| 


dG(y) dF (x) 
dy 7 dx 


g(y= 


Differentiable One-To-One Functions The method used in Example 3.8.7 general- 
izes to very arbitrary differentiable one-to-one functions. Before stating the general 
result, we should recall some properties of differentiable one-to-one functions from 
calculus. Let r be a differentiable one-to-one function on the open interval (a, b). 
Then r is either strictly increasing or strictly decreasing. Because r is also continu- 
ous, it will map the interval (a, b) to another open interval (a, 8), called the image of 
(a, b) under r. That is, for each x € (a, b), r(x) € (@, 6), and for each y € (a, f) there is 
x € (a, b) such that y = r(x) and this y is unique because r is one-to-one. So the inverse 
s of r will exist on the interval (a, 6), meaning that for x € (a, b) and y € (a, B) we 
have r(x) = y if and only if s(y) = x. The derivative of s will exist (possibly infinite), 
and it is related to the derivative of r by 
-1 


Let X be arandom variable for which the p.d-f. is f and for which Pr(a < X <b) =1. 
(Here, a and/or b can be either finite or infinite.) Let Y = r(X), and suppose that r(x) 
is differentiable and one-to-one for a < x < b. Let (a, B) be the image of the interval 
(a, b) under the function r. Let s(y) be the inverse function of r(x) for a < y < B. 
Then the p.d.f. g of Y is 


ds(y) _ { dr(x) 
dy  \ dx 


ds(y) 


otherwise. 


Example 
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Proof If, is increasing, then s is increasing, and for each y € (a, f), 
G(y) = Pr(¥ < y) = Pr[r(X) < y]=Pr[X <s(y)] = F[s(y)]. 


It follows that G is differentiable at all y where both s is differentiable and where 
F(x) is differentiable at x = s(y). Using the chain rule for differentiation, it follows 
that the p.d.-f. g(y) for a < y < 6 will be 


dG dF\s ds(y 
() _ AFIS _ ppscyy) OD, 

dy dy dy 
Because s is increasing, ds(y)/dy is positive; hence, it equals |ds(y)/dy| and Eq. 


(3.8.4) implies Eq. (3.8.3). Similarly, if r is decreasing, then s is decreasing, and for 
each y € (a, B), 


g(y) = (3.8.4) 


G(y) = Pr[r(X) < y]= Pr[X = s(y)]=1— F[s()]. 


Using the chain rule again, we differentiate G to get the p.d-f. of Y 


dG ds 
e() = = — pis) (3.8.5) 
dy dy 
Since s is strictly decreasing, ds(y)/dy is negative so that —ds(y)/dy equals |ds(y)/ 
dy|. It follows that Eq. (3.8.5) implies Eq. (3.8.3). rT 


Microbial Growth. A popular model for populations of microscopic organisms in 
large environments is exponential growth. At time 0, suppose that v organisms are 
introduced into a large tank of water, and let X be the rate of growth. After time 
t, we would predict a population size of ve*’. Assume that X is unknown but has a 
continuous distribution with p.d.f. 


fa@)= 3(1—x)? for0 me <1, 

0 otherwise. 
We are interested in the distribution of Y = ve*' for known values of v and t. For 
concreteness, let v = 10 and t = 5, so that r(x) = 10e™*. 

In this example, Pr(0 < X < 1) =1 andr is a continuous and strictly increasing 
function of x for 0 <x <1. As x varies over the interval (0, 1), it is found that 
y =r(x) varies over the interval (10, 10e>). Furthermore, for 10 < y < 10e°, the 
inverse function is s(y) = log(y/10)/5. Hence, for 10 < y < 10e?, 


ds(y) 1 
dy — Sy’ 
It follows from Eq. (3.8.3) that g(y) will be 
3(1 — log(y/10)/5)* 
sy) = Sy 
0 otherwise. < 


for 10 < y < 10e°, 


Summary 


We learned several methods for determining the distribution of a function of a 
random variable. For a random variable X with a continuous distribution having 
p.d.f. f, if r is strictly increasing or strictly decreasing with differentiable inverse 
s (Le., s(r(x)) =x and »s is differentiable), then the p.d.f. of Y=r(X) is g(y)= 
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f (s(v))|ds(v)/dy|. A special transformation allows us to transform a random variable 
X with the uniform distribution on the interval [0, 1]into arandom variable Y with an 
arbitrary continuous c.d.f. G by Y = G~!(X). This method can be used in conjunction 
with a uniform pseudo-random number generator to generate random variables with 
arbitrary continuous distributions. 


Exercises 


1. Suppose that the p.d-f. of a random variable X is as 
follows: 


3x2 for0 <x <1, 


fie 


0 otherwise. 
Also, suppose that Y = 1 — X?. Determine the p.d-f. of Y. 


2. Suppose that a random variable X can have each of the 
seven values —3, —2, —1, 0, 1, 2, 3 with equal probability. 
Determine the p.f. of Y = X? — X. 


3. Suppose that the p.d.f. of a random variable X is as 
follows: 


AG 
i= xx for0 <x <2, 

0 otherwise. 
Also, suppose that Y = X (2 — X). Determine the c.d-f. and 
the p.d-f. of Y. 


4. Suppose that the p.d.f. of X is as given in Exercise 3. 
Determine the p.d.f. of Y = 4 — X?. 


5. Prove Theorem 3.8.2. (Hint: Either apply Theorem 
3.8.4 or first compute the c.d.f. seperately for a > 0 and 
a <0.) 


6. Suppose that the p.d.f. of X is as given in Exercise 3. 
Determine the p.d.f. of Y = 3X +2. 


7. Suppose that a random variable X has the uniform 
distribution on the interval [0, 1]. Determine the p.d_f. of 
(a) X”, (b) —X?, and (ce) X!/”. 


8. Suppose that the p.d.f. of X is as follows: 


e* forx > 0, 
ed 


0 for x <0. 
Determine the p.d.f. of Y = X!/?. 


9. Suppose that X has the uniform distribution on the 
interval [0, 1]. Construct a random variable Y = r(X) for 
which the p.d.f. will be 


3,2 
s6yr= zy de ai 
0 otherwise. 


10. Let X be a random variable for which the p.d.f f is as 
given in Exercise 3. Construct a random variable Y = r(X) 
for which the p.d-f. g is as given in Exercise 9. 


11. Explain how to use a uniform pseudo-random number 
generator to generate four independent values from a 
distribution for which the p.d_f. is 


eve 3(2y+1) for0<y<1, 
otherwise. 


12. Let F be an arbitrary c.d-f. (not necessarily discrete, 
not necessarily continuous, not necessarily either). Let 
F~'be the quantile function from Definition 3.3.2. Let X 
have the uniform distribution on the interval [0, 1]. Define 
Y = F~1(X). Prove that the c.d.f. of Y is F. Hint: Compute 
Pr(Y < y) in two cases. First, do the case in which y is the 
unique value of x such that F(x) = F(y). Second, do the 
case in which there is an entire interval of x values such 
that F(x) = F(y). 


13. Let Z be the rate at which customers are served in a 
queue. Assume that Z has the p.d-f. 


Ve 


0 otherwise. 


for z > 0, 


ro| 


Find the p.d.f. of the average waiting time T = 1/Z. 


14. Let X have the uniform distribution on the interval 
[a, b], and let c > 0. Prove that cX +d has the uniform 
distribution on the interval [ca + d, cb + d]. 


15. Most of the calculation in Example 3.8.4 is quite gen- 
eral. Suppose that X has a continuous distribution with 
p.d.f. f. Let Y = X*, and show that the p.d-f. of Y is 


__i 1/2 1/2 
gy) = piallo Fier) 
16. In Example 3.8.4, the p.d.f. of Y = X? is much larger 
for values of y near 0 than for values of y near 1 despite 
the fact that the p.d.f. of X is flat. Give an intuitive reason 


why this occurs in this example. 


17. Aninsurance agent sells a policy which has a $100 de- 
ductible and a $5000 cap. This means that when the policy 
holder files a claim, the policy holder must pay the first 
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$100. After the first $100, the insurance company pays the a. Write Y as a function of X,ie., Y=r(X). 
rest of the claim up to a maximum payment of $5000. Any b. Find the c.d-f. of Y. 


excess must be paid by the policy holder. Suppose that the 
dollar amount X of a claim has a continuous distribution 
with p.d-f. f(x) =1/U + x)? for x > 0 and 0 otherwise. Let 


c. Explain why Y has neither a continuous nor a dis- 
crete distribution. 


Y be the amount that the insurance company has to pay 


on the claim. 


Example 
3.9.1 


Theorem 
3.9.1 


3.9 Functions of Two or More Random Variables 


When we observe data consisting of the values of several random variables, we 
need to summarize the observed values in order to be able to focus on the infor- 
mation in the data. Summarizing consists of constructing one or a few functions 
of the random variables that capture the bulk of the information. In this section, 
we describe the techniques needed to determine the distribution of a function of 
two or more random variables. 


Random Variables with a Discrete Joint Distribution 


Bull Market. Three different investment firms are trying to advertise their mutual 
funds by showing how many perform better than a recognized standard. Each com- 
pany has 10 funds, so there are 30 in total. Suppose that the first 10 funds belong to the 
first firm, the next 10 to the second firm, and the last 10 to the third firm. Let X; = 1 
if fund i performs better than the standard and X; = 0 otherwise, fori =1,..., 30. 
Then, we are interested in the three functions 


Yj, =X1,+---+ Xo, 
Y= Xu +--+ + X20, 
Yq = AQ) 73> +P Xag: 


We would like to be able to determine the joint distribution of Y,, Y., and Y3 from 
the joint distribution of Xj, ..., X30. 4 


The general method for solving problems like those of Example 3.9.1 is a straight- 
forward extension of Theorem 3.8.1. 


Functions of Discrete Random Variables. Suppose that n random variables X;,..., X, 
have a discrete joint distribution for which the joint p.f. is f, and that m functions 
Y,,..., Y,, of these n random variables are defined as follows: 

Y, =r (X1, es Xs 


Y, =17(X4, Pastas X,); 


Lin = Pn (X4, SRD X,)- 
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Example 
3.9.2 


Theorem 
3.9.2 


Example 
3.9.3 


For given values yj, ..., ¥,, of the m random variables Y,,..., Y,,, let A denote the 
set of all points (x1, ..., x,,) such that 

r1(%1,--->X%n) =V1,> 

ro(X1,---5Xn) =o, 

Tm(X1, +++, Xn) =Ym- 
Then the value of the joint p.f. g of Yj,..., Y,, is specified at the point (y1, ..., Ym) 
by the relation 

B01 Im = DL FO Hp n 
(x4, +X, EA 


Bull Market. Recall the situation in Example 3.9.1. Suppose that we want the joint 
p.f. g of (1%, Yo, Y3) at the point (3, 5, 8). That is, we want g9(3, 5, 8) = Pr(Y, = 3, Yo = 
5, Y3 = 8). The set A as defined in Theorem 3.9.1 is 


A ={(%, -.-, 430) 4p +++ + x19 = 3, Hy $+ + X20 = 5, X21 +++ + + X39 = 8}. 
Two of the points in the set A are 
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0), 
(1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1). 


A counting argument like those developed in Sec. 1.8 can be used to discover that 


there are 
(”?) (*) (*) = 1,360,800 
3 5 8 


points in A. Unless the joint distribution of X;, ..., X39 has some simple structure, 
it will be extremely tedious to compute g(3, 5, 8) as well as most other values of g. 
For example, if all of the 2°° possible values of the vector (X1,..., X39) are equally 
likely, then 


1,360,800 


= —3 
530 = 1.27 x 10°. < 


g(3, 5, 8) = 


The next result gives an important example of a function of discrete random variables. 


Binomial and Bernoulli Distributions. Assume that X,,..., X,, are i.i.d. random vari- 
ables having the Bernoulli distribution with parameter p. Let Y = X,+---+ X,. 
Then Y has the binomial distribution with parameters n and p. 


Proof It is clear that Y = y if and only if exactly y of X;,..., X, equal 1 and the 
other n — y equal 0. There are (") distinct possible values for the vector (X),..., X,) 
that have y ones and n — y zeros. Each such vector has probability p’(1 — p)"~” of 
being observed; hence the probability that Y = y is the sum of the probabilities of 
those vectors, namely, (")p?d — p)"” for y=0,...,”. From Definition 3.1.7, we 
see that Y has the binomial distribution with parameters n and p. rT] 


Sampling Parts. Suppose that two machines are producing parts. For i = 1, 2, the 
probability is p; that machine i will produce a defective part, and we shall assume 
that all parts from both machines are independent. Assume that the first n; parts 
are produced by machine 1 and that the last n, parts are produced by machine 2, 


Example 
3.9.4 


Figure 3.24 The set A, in 
Example 3.9.4 and in the 
proof of Theorem 3.9.4. 
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with n =n; + n> being the total number of parts sampled. Let X; = 1 if the ith part 
is defective and X; = 0 otherwise fori =1,...,n. Define Y; = X;+---+ X,, and 
Yy = X,,41+++:+X,. These are the total numbers of defective parts produced by 
each machine. The assumptions stated in the problem allow us to conclude that Y, and 
Y> are independent according to the note about separate functions of independent 
random variables on page 140. Furthermore, Theorem 3.9.2 says that Y; has the 
binomial distribution with parameters n; and p; for j = 1, 2. These two marginal 
distributions, together with the fact that Y, and Y are independent, give the entire 
joint distribution. So, for example, if g is the joint p.f. of Y, and Y,, we can compute 


n , _ fn - 
(V1, 2) = ( ; pid — py)” »( °) pea — p)"2-2, 
J y2 


for y; =0,...,, and yy =0,..., m5, while g(91, yy) = 0 otherwise. There is no need 
to find a set A as in Example 3.9.2, because of the simplifying structure of the joint 
distribution of X;,..., X,- <l 


Random Variables with a Continuous Joint Distribution 


Total Service Time. Suppose that the first two customers in a queue plan to leave 
together. Let X; be the time it takes to serve customer i for i = 1, 2. Suppose also that 
X, and X, are independent random variables with common distribution having p.d.f. 
f (x) =2e~** for x > 0 and 0 otherwise. Since the customers will leave together, they 
are interested in the total time it takes to serve both of them, namely, Y = X; + X>. 
We can now find the p.d-f. of Y. 

For each y, let 


Ay = {(X1, x2) 2x1 +X2 < yy}. 


Then Y < y if and only if (X;, X2) € A,. The set A, is pictured in Fig. 3.24. If we let 
G(y) denote the c.d.f. of Y, then, for y > 0, 


y py-xo 
G(y) =Pr((X, X) € Ay) = / , de-2*1- 22 y,dxy 
. 0 0 


y y 
= Qe 22 [1 - e 20-9) dx) -|/ [2e-% - 20 | dx) 


=1-—¢-7) —2ye~?’, 
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Theorem 
3.9.3 


Theorem 
3.9.4 


Taking the derivative of G(y) with respect to y, we get the p.d.f. 
a(y) = Es [1 age ye? ] =4ye~*), 
dy 
for y > 0 and 0 otherwise. < 


The transformation in Example 3.9.4 is an example of a brute-force method that is 
always available for finding the distribution of a function of several random variables, 
however, it might be difficult to apply in individual cases. 


Brute-Force Distribution of a Function. Suppose that the joint p.d.f. of X = (X),..., X,,) 
is f(x) and that Y =r(X). For each real number y, define A, = {x : r(x) < y}. Then 
the c.d.f. G(y) of Y is 


Gi) = / ‘- / f(x) dx. (3.9.1) 


Proof From the definition of c.d.f., 

G(y) =Pr(¥ < y) = Pr[r(X) < y]=Pr(X € Ay), 
which equals the right side of Eq. (3.9.1) by Definition 3.7.3. rT] 
If the distribution of Y also is continuous, then the p.d.f. of Y can be found by 


differentiating the c.d.f. G(y). 
A popular special case of Theorem 3.9.3 is the following. 


Linear Function of Two Random Variables. Let X, and X, have joint p.d.f. f(x, x2), 
and let Y =a,X, +a)X2 +5 witha, £0. Then Y has a continuous distribution whose 
p.d.f. is 


gly) =i. f (2 = a ») : dx). (3.9.2) 


ay |a4| 


Proof First, we shall find the c.d.f. G of Y whose derivative we will see is the function 
g in Eq. (3.9.2). For each y, let Ay = {(x1, x2) :ayx1 + dpx2 +b < y}. The set A, has 
the same general form as the set in Fig. 3.24. We shall write the integral over the set 
A, with x in the outer integral and x, in the inner integral. Assume that a, > 0. The 
other case is similar. According to Theorem 3.9.3, 


foe) (y—b—apx2)/a4 
G(y) = | i: f (x4, X9)dx1dxp = / / f (x4, X2)dx1dx7. (3.9.3) 
y —o0o J—0o 


For the inner ue perform the change of variable z = a,x; + aox2 + b whose 
inverse is x; = (z — b — ayx>)/ay, so that dx; = dz/a,. The inner integral, after this 
change of variable, elie 


[ p (22s) 1 
—oo ay ay 


We can now substitute this expression for the inner integral into Eq. (3.9.3): 


yy a a 
Gy) = I. I. (ee we 2) L dade 
ay 
=[- io p(s 7%) 2) = dxydz, (3.9.4) 
1 


Definition 
3.9.1 


Example 
3.9.5 
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Let g(z) denote the inner integral on the far right side of Eq. (3.9.4). Then we have 
G(y) = i g(z)dz, whose derivative is g(y), the function in Eq. (3.9.2). r 


The special case of Theorem 3.9.4 in which X, and X> are independent, a; = ay = 1, 
and b = 0 is called convolution. 


Convolution. Let X; and X, be independent continuous random variables and let 
Y = X, + X>. The distribution of Y is called the convolution of the distributions of 
X, and X>. The p.d.f. of Y is sometimes called the convolution of the p.d.f.’s of X; and 
XG. 


If we let the p.d.f. of X; be f; fori = 1, 2 in Definition 3.9.1, then Theorem 3.9.4 (with 
a, = dy = 1 and b = 0) says that the p.d.f. of Y = X, + X2 is 


iis / Aig) Reyes (3.9.5) 


Equivalently, by switching the names of X, and X,, we obtain the alternative form 
for the convolution: 


g(y) = fi@ foly — z) dz. (3.9.6) 


The p.d.f. found in Example 3.9.4 is the special case of (3.9.5) with f,(x) = fo(x) = 
2e~>* for x > 0 and 0 otherwise. 


An Investment Portfolio. Suppose that an investor wants to purchase both stocks and 
bonds. Let X, be the value of the stocks at the end of one year, and let X, be the 
value of the bonds at the end of one year. Suppose that X, and X> are independent. 
Let X, have the uniform distribution on the interval [1000, 4000], and let X> have the 
uniform distribution on the interval [800, 1200]. The sum Y = X, + X> is the value at 
the end of the year of the portfolio consisting of both the stocks and the bonds. We 
shall find the p.d.f. of Y. The function f,(z) fo(y — z) in Eq. (3.9.6) is 


8.333 x 10-7 for 1000 < z < 4000 
A@AY - 2) = and 800 < y — z < 1200, (3.9.7) 
0 otherwise. 


We need to integrate the function in Eq. (3.9.7) over z for each value of y to get 
the marginal p.d-f. of Y. It is helpful to look at a graph of the set of (y, z) pairs for 
which the function in Eq. (3.9.7) is positive. Figure 3.25 shows the region shaded. For 
1800 < y < 2200, we must integrate z from 1000 to y — 800. For 2200 < y < 4800, we 
must integrate z from y — 1200 to y — 800. For 4800 < y < 5200, we must integrate z 
from y — 1200 to 4000. Since the function in Eq. (3.9.7) is constant when it is positive, 
the integral equals the constant times the length of the interval of z values. So, the 
p.d.f. of Y is 


8.333 x 10-7(y — 1800) for 1800 < y < 2200, 


sons 3.333 x 10-4 for 2200 < y < 4800, 
8.333 x 1077(5200 — y) for 4800 < y < 5200, 
0 otherwise. <J 


As another example of the brute-force method, we consider the largest and 
smallest observations in a random sample. These functions give an idea of how spread 
out the sample is. For example, meteorologists often report record high and low 


180 Chapter 3 Random Variables and Distributions 


Figure 3.25 The region 
where the function in 
Eq. (3.9.7) is positive. 


Example 
3.9.6 


3500 -- 


3000 +- 


1500 +- 


1000 -- 


> 
y 


temperatures for specific days as well as record high and low rainfalls for months 
and years. 


Maximum and Minimum ofa Random Sample. Suppose that X;,..., X,, formarandom 
sample of size n from a distribution for which the p.d.f. is f and the c.d.f. is F. The 
largest value Y, and the smallest value Y; in the random sample are defined as follows: 


Y, = max{X,,..., Xy}, 
Y => min{X}, rety Kh: (3.9.8) 


Consider Y,, first. Let G,, stand for its c.d-f., and let g, be its p.d.f. For every given 
value of y (—oo < y < ov), 


G,(y) = Pr(Y, < y) =Pr(X, < y, X2 Sy,..-, Xn Sy) 
= Pr(X; < y) Pr(X. < y)--- Prix, < y) 
= F(y)F(y)-:- FQ) =[FO)]’, 


where the third equality follows from the fact that the X; are independent and 
the fourth follows from the fact that all of the X; have the same c.d.f. F. Thus, 


Girly) =[FO)]". 
Now, g, can be determined by differentiating the c.d.f. G,,. The result is 
Bly) =n[ FI") for —c0 < y <o0. 


Next, consider Y, with c.d.f. G; and p.d.f. g;. For every given value of y (—oo < 
y <0), 
Gi(y) = Pr(Y < y) =1— Pr(% > y) 
=1-—Pr(X,>y, X.>y,...,X,>YV) 
=1—Pr(X, > y) Pr(X) > y)--- Pr(x,, > y) 
=—l-T=FOll= l= Fo) 
=1-[1- FO)]". 
Thus, Gy(y) =1—[1— F(y)]”. 
Then g; can be determined by differentiating the c.d.f. G;. The result is 


gi(y) =n[1— F()]""1fQ) for -00 <y <ow. 


Figure 3.26 The p.d-f. of the 
uniform distribution on the 
interval [0, 1] together with 
the p.d.f’s of the minimum 
and maximum of samples 

of size n = 5. The p.d-f. of 
the range of a sample of size 
n =5 (see Example 3.9.7) is 
also included. 


Example 
3.9.7 
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p.d.f. 
5 Single random variable 7 
Minimum of 5 me 
7a ee ee Maximum of 5 2 
——-— Range of 5 o 
3 
2 one. 
“7 - ooo 
ot a“ N 
1 ‘ ? ae XN 
>< we ‘\ 
oe" \ 
mere Te Tresene nee \ 
7 : oe 
0.2 0.4 0.6 0.8 10 * 


Figure 3.26 shows the p.d.f. of the uniform distribution on the interval [0, 1] 
together with the p.d.f.’s of Y, and Y,, for the case n = 5. It also shows the p.d-f. of 
Ys — Y,, which will be derived in Example 3.9.7. Notice that the p.d-f. of Y; is highest 
near 0 and lowest near 1, while the opposite is true of the p.d.f. of Y,,, as one would 
expect. 

Finally, we shall determine the joint distribution of Y, and Y,,. For every pair 
of values (y,, y,,) such that —oo < y; < y, < 00, the event {Y; < y;} N{Y,, < y,} is the 
same as {Y,, < y,} N{Y, > »4}°. If G denotes the bivariate joint c.d-f. of Y; and Y,,, then 

G(y1, Yn) = Pr(Y < yy and Y, < yy) 
= Pr(Y, = Yn) = Pr(Y, Zn and ay > yp 
= Pr(Y,, < Yn) 


— Pry <2 Ie YS XG SVs ee oe VS Ay Sn) 
n 
= Gy(vn) — | [ Pro < X; < yn) 
i=] 
= [FO,,)]”" _ [F (yn) — F(y)]". 


The bivariate joint p.d.f. g of Y,; and Y,, can be found from the relation 


iG (1, 
co. es, 
dYOY, 
Thus, for —oo < yj < y, < 00, 
80015 Yn) = — DEF On) — FOWN" FOODS On). (3.9.9) 
Also, for all other values of y; and y,, g(y, y,) =0. | 


A popular way to describe how spread out is a random sample is to use the 
distance from the minimum to the maximum, which is called the range of the random 
sample. We can combine the result from the end of Example 3.9.6 with Theorem 3.9.4 
to find the p.d.f. of the range. 


The Distribution of the Range of a Random Sample. Consider the same situation as in 
Example 3.9.6. The random variable W = Y,, — Yj is called the range of the sample. 
The joint p.d.f. g(y), y,) of Y, and Y,, was presented in Eq. (3.9.9). We can now apply 
Theorem 3.9.4 with a; = —1, a, = 1, and b = 0 to get the p.d.f. A of W: 
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CO lee) 
h(w) = g(y, — Ww, YAY, =I g(z,z+w)dz, (3.9.10) 
—oo CO 


where, for the last equality, we have made the change of variable z = y, — w. <l 


Here is a special case in which the integral of Eq. 3.9.10 can be computed in 
closed form. 


The Range of a Random Sample from a Uniform Distribution. Suppose that the n random 
variables X;,..., X,, form a random sample from the uniform distribution on the 
interval [0, 1]. We shall determine the p.d.f. of the range of the sample. 

In this example, 


fa) = 1 ford=<x <1, 
a= NG otherwise, 
Also, F(x) =x for 0 <x <1. We can write g(y,, y,) from Eq. (3.9.9) in this case as 
8(M1, Yn) = | nn DOn = i= for 0 < 91 <n < 1, 
otherwise. 


Therefore, in Eq. (3.9.10), g(z, z+ w) =0 unless 0 < w <1 and 0 <z<1-—vw. For 
values of w and z satisfying these conditions, g(z, w + z) =n(n — 1)w"-2. The p.d.f. 
in Eq. (3.9.10) is then, for 0 < w <1, 


l-w 
h(w) = n(n —1)w"? dz =n(n —1)w" 71 — w). 
0 


Otherwise, h(w) = 0. This p.d-f. is shown in Fig. 3.26 for the case n = 5. < 


Direct Transformation of a Multivariate p.d.f. 


Theorem 
3.9.5 


Next, we state without proof a generalization of Theorem 3.8.4 to the case of several 
random variables. The proof of Theorem 3.9.5 is based on the theory of differentiable 
one-to-one transformations in advanced calculus. 


Multivariate Transformation. Let X,,..., X, have a continuous joint distribution 
for which the joint p.d.f. is f. Assume that there is a subset S of R” such that 
Pr[(X 1, ..., X,) € S]=1. Define n new random variables Y,, ..., Y,, as follows: 
Y; =7r1(X, Bd, a Xn) 
Yn =19(Xq,.--, Xn), 
2= rN w (3.9.11) 
Y= Ts 2225 X4)s 


where we assume that the n functions r,,..., 7, define a one-to-one differentiable 
transformation of S onto a subset T of R”. Let the inverse of this transformation be 
given as follows: 


y= 511, aa aeg Yn) 


X72 = 52(V1, +--+ Yn)s (3.9.12) 


n= Sn tee Yn): 


Example 
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Then the joint p.d-f. g of Y;,..., Y,, is 
FOivew sg Slt] TOPO iss +%% Dyke Ts 
8p -- 3) = . " (3.9.13) 
0 otherwise, 
where J is the determinant 
ayy IVn 
J = det Bee : 
dy Yn 
and |/| denotes the absolute value of the determinant J. r 
Thus, the joint p.d.f. g(y1,..., y,) is obtained by starting with the joint p.d.f 
f(y, .--,X,), replacing each value x; by its expression s;(yj,..., y,) in terms of 
yj,-+-++» Y,, and then multiplying the result by |J|. This determinant J is called the 


Jacobian of the transformation specified by the equations in (3.9.12). 


Note: The Jacobian Is a Generalization of the Derivative of the Inverse. Eqs. (3.8.3) 
and (3.9.13) are very similar. The former gives the p.d.f. of a single function of a 
single random variable. Indeed, if n = 1 in (3.9.13), J = ds1(y4)/dy, and Eq. (3.9.13) 
becomes the same as (3.8.3). The Jacobian merely generalizes the derivative of the 
inverse of a single function of one variable to n functions of n variables. 


The Joint p.d.f. of the Quotient and the Product of Two Random Variables. Suppose that 
two random variables X, and X> have a continuous joint distribution for which the 
joint p.d.f. is as follows: 


4x4X9 for 0 <x, <1land0 <x, <1, 
f (%1, X2) = 

0 otherwise. 
We shall determine the joint p.d.f. of two new random variables Y, and Y, which are 
defined by the relations 


xX 
Y = = and Y> = X1X. 
XQ 


In the notation of Theorem 3.9.5, we would say that Y; =1r,(Xj, X) and Y, = 
rn(X1, X2), where 


r(x}, X) => = and ry (X}, X9) = X4X2. (3.9.14) 
2 


The inverse of the transformation in Eq. (3.9.14) is found by solving the equations 
yy =r (Xq, X) and yy =19(xy, x2) for x, and x, in terms of y, and y,. The result is 


X= 51(Vp, Y2) = On)”, 


wy? (3.9.15) 
X2 = 52(Y,, = (=H ]} .- 
V1 


Let S denote the set of points (x1, x2) such that 0 < x, <1 and 0 < x, <1, so that 
Pr[(X1, X2) € S]=1. Let T be the set of (y4, y>) pairs such that (1, y) € T if and only 
if (s1(y1, 2), 82091, y2)) € S. Then Pr[(%j, Y2) € T] = 1. The transformation defined by 
the equations in (3.9.14) or, equivalently, by the equations in (3.9.15) specifies a one- 
to-one relation between the points in S and the points in T. 
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J2A 


x2 


yo/y,= 1 


Figure 3.27 The sets S and T in Example 3.9.9. 


We shall now show how to find the set T. We know that (x1, x2) € S if and only 
if the following inequalities hold: 


x1 > 0, Xy< 1, x2 > 0, and x2< ih, (3.9.16) 


We can substitute the formulas for x, and x in terms of y, and y from Eq. (3.9.15) 
into the inequalities in (3.9.16) to obtain 


1/2 
and (22) <1. (3.9.17) 


The first inequality transforms to (y, > 0 and y> > 0) or (y, < Oand y < 0). However, 
since yj = x|/x2, we cannot have y, < 0, so we get only y, > 0 and y, > 0. The third 
inequality in (3.9.17) transforms to the same thing. The second inequality in (3.9.17) 
becomes y, < 1/y,. The fourth inequality becomes y, < y,. The region T where 
(1, y2) Satisfy these new inequalities is shown in the right panel of Fig. 3.27 with 
the set S in the left panel. 

For the functions in (3.9.15), 


ts = 1 (2)" as, _ 1 
dy, 2\y,) 7 ayy 2 


1/2 
0S _ 1 y2 : CRY) _ 1 ( 1 : 
dy, 2\y} 7 ay. 2 \yy» 


OO) 
2\y1 2\ y2 1 


J =det 1/2 =—. 
Af ye a 1 y 2y, 
2\y? 2\ y1y2 


Since y, > 0 throughout the set T, |J| = 1/(2y). 
The joint p.d-f. (94, y2) can now be obtained directly from Eq. (3.9.13) in the 
following way: In the expression for f(x,, x2), replace x, with (y,y,)//?, replace x, 


Hence, 
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with (y/y,)!/*, and multiply the result by |/| = 1/(2y,). Therefore, 


2(2) for (4, ») €T, 


0 otherwise. < 


8(%1, 2) = | 


Service Time in a Queue. Let X be the time that the server in a single-server queue 
will spend on a particular customer, and let Y be the rate at which the server can 
operate. A popular model for the conditional distribution of X given Y = y is to say 
that the conditional p.d.f. of X given Y = y is 


ye *” forx > 0, 
gly) = | 
otherwise. 
Let Y have the p.d-f. f(y). The joint p.d-f. of (X, Y) is then g)(x|y) fo(y). Because 
1/Y can be interpreted as the average service time, Z = XY measures how quickly, 
compared to average, that the customer is served. For example, Z = 1 corresponds 
to an average service time, while Z > 1 means that this customer took longer than 
average, and Z <1 means that this customer was served more quickly than the 
average customer. If we want the distribution of Z, we could compute the joint p.d.f. 
of (Z, Y) directly using the methods just illustrated. We could then integrate the joint 
p.d.f. over y to obtain the marginal p.d.f. of Z. However, it is simpler to transform the 
conditional distribution of X given Y = y into the conditional distribution of Z given 
Y = y, since conditioning on Y = y allows us to treat Y as the constant y. Because 
X = Z/Y, the inverse transformation is x = s(z), where s(z) = z/y. The derivative of 
this is 1/y, and the conditional p.d-f. of Z given Y = y is 


1 Zz 
hy(zly) = *ai( 2] y). 
y BA 


Because Y is arate, Y >O and X = Z/Y > Oif and only if Z > 0. So, 


e* forz>0, 
hy(zly) = (3.9.18) 
0 otherwise. 
Notice that h; does not depend on y, so Z is independent of Y and h is the marginal 
p.d.f. of Z. The reader can verify all of this in Exercise 17. <1 


Note: Removing Dependence. The formula Z = XY in Example 3.9.10 makes it 
look as if Z should depend on Y. In reality, however, multiplying X by Y removes the 
dependence that X already has on Y and makes the result independent of Y. This type 
of transformation that removes the dependence of one random variable on another 
is a very powerful technique for finding marginal distributions of transformations of 
random variables. 

In Example 3.9.10, we mentioned that there was another, more straightforward 
but more tedious, way to compute the distribution of Z. That method, which is useful 
in many settings, is to transform (X, Y) into (Z, W) for some uninteresting random 
variable W and then integrate w out of the joint p.d.f. All that matters in the choice 
of W is that the transformation be one-to-one with differentiable inverse and that 
the calculations are feasible. Here is a specific example. 


One Function of Two Variables. In Example 3.9.9, suppose that we were interested 
only in the quotient Y; = X,/X> rather than both the quotient and the product 
Y> = X,X). Since we already have the joint p.d-f. of (Y;, Y2), we will merely integrate 
y, out rather than start from scratch. For each value of y,; > 0, we need to look at the 
set T in Fig. 3.27 and find the interval of y, values to integrate over. For 0 < y, < 1, 


186 Chapter 3 Random Variables and Distributions 


we integrate over 0 < y < yy. For y, > 1, we integrate over 0 < yy < 1/)4. (For yy =1 
both intervals are the same.) So, the marginal p.d.f. of Y is 


m9 (2) dy, for0<y,<1, 
fa (2) dy, fory,;>1, 
{: for 0 < y, <1, 


si) = 


4 fory,>1. 


There are other transformations that would have made the calculation of g, simpler 


if that had been all we wanted. See Exercise 21 for an example. 4 
Theorem Linear Transformations. Let X = (X),..., X,,) have acontinuous joint distribution for 
3.9.6 which the joint p.d.f. is f. Define Y = (Yj, ..., Y,,) by 

Y=AX, (3.9.19) 
where A is a nonsingular n x n matrix. Then Y has a continuous joint distribution 

with p.d.f. 

1 =i 

= —— f(A forye R", 3.9.20 
s(y) idet Ay’ ‘ y) fory (3.9.20) 


where A~! is the inverse of A. 


Proof Each Y; is a linear combination of X,,..., X,,. Because A is nonsingular, the 
transformation in Eq. (3.9.19) is a one-to-one transformation of the entire space R” 
onto itself. At every point y € R”, the inverse transformation can be represented by 
the equation 


x=Avly. (3.9.21) 


The Jacobian J of the transformation that is defined by Eq. (3.9.21) is simply J = 
det A~!. Also, it is known from the theory of determinants that 


dopa 
det A 
Therefore, at every point y € R”, the joint p.d-f. g(y) can be evaluated in the fol- 
lowing way, according to Theorem 3.9.5: First, for i =1,...,, the component x; in 
f (x1, ...,X,) is replaced with the ith component of the vector A~!y. Then, the result 
is divided by |det A|. This produces Eq. (3.9.20). rT] 


>, 
“ 


Summary 


We extended the construction of the distribution of a function of a random variable 
to the case of several functions of several random variables. If one only wants the 
distribution of one function r; of n random variables, the usual way to find this is to 
first find — 1 additional functions rp, ..., 7,80 that the n functions together compose 
a one-to-one transformation. Then find the joint p.d.f. of the n functions and finally 
find the marginal p.d.f. of the first function by integrating out the extran — 1 variables. 
The method is illustrated for the cases of the sum and the range of several random 
variables. 


Exercises 


1. Suppose that X; and X> are i.i1.d. random variables and 
that each of them has the uniform distribution on the 
interval [0, 1]. Find the p.d.f. of Y = X1 + Xp. 


2. For the conditions of Exercise 1, find the p.d.f. of the 
average (X, + X>)/2. 


3. Suppose that three random variables X,, X>, and X3 
have a continuous joint distribution for which the joint 
p.d.f. is as follows: 


8x4xX9x3 for0 <x; <1 @ =1, 2, 3), 
fq, X2, x3) = 

0 otherwise. 
Suppose also that Y; = X1, Y7 = X1Xo, and Y3 = X1X 7X3. 
Find the joint p.d-f. of Y,, Y2, and Y3. 


4. Suppose that X; and X have a continuous joint distri- 
bution for which the joint p.d_f. is as follows: 


X1 + X2 for 0 <x, <1land0 <x <1, 
fy, x2) = a 


otherwise. 
Find the p.d.f. of Y = X,X>. 


5. Suppose that the joint p.d.f. of X, and X> is as given in 
Exercise 4. Find the p.d.f. of Z = X4/ Xp. 


6. Let X and Y be random variables for which the joint 
p.d.f. is as follows: 


2(x+y) forO<x<y<l, 
fon =| 


0 otherwise. 
Find the p.d.f. of Z=X+Y. 


7. Suppose that X; and X> are i.i.d. random variables and 
that the p.d.f. of each of them is as follows: 


e* forx > 0, 
0 otherwise. 


r={ 


Find the p.d.f. of Y = X; — X>. 


8. Suppose that Xj, ..., X, forma random sample of size 
n from the uniform distribution on the interval [0, 1] and 
that Y,, = max {X,,..., X,,}. Find the smallest value of n 
such that 


Pr{Y,, > 0.99} = 0.95. 


9. Suppose that then variables X;,..., X,, formarandom 
sample from the uniform distribution on the interval [0, 1] 
and that the random variables Y, and Y, are defined as 
in Eq. (3.9.8). Determine the value of Pr(Y, < 0.1 and 
Y,, < 0.8). 


10. For the conditions of Exercise 9, determine the value 
of Pr(Y, < 0.1 and Y,, > 0.8). 
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11. For the conditions of Exercise 9, determine the prob- 
ability that the interval from Y, to Y, will not contain the 
point 1/3. 


12. Let W denote the range of a random sample of n 
observations from the uniform distribution on the interval 
[0, 1]. Determine the value of Pr(W > 0.9). 


13. Determine the p.d-f. of the range of a random sample 
of n observations from the uniform distribution on the 
interval [—3, 5]. 


14. Suppose that X;,..., X, form a random sample of n 
observations from the uniform distribution on the interval 
[0, 1], and let Y denote the second largest of the observa- 
tions. Determine the p.d.f. of Y. Hint: First determine the 
c.d.f. G of Y by noting that 


G(y) =Pr(Y < y) 
= Pr(At least n — 1 observations < y). 


15. Show that if X;, Xo, ..., X,, are independent random 
variables and if Y; =r ,(X4), Yo =19(X2), ..-, Vy =Tp (Xn), 
then Y;, Y>,..., Y, are also independent random vari- 
ables. 


16. Suppose that Xj, X>,..., X5 are five random vari- 
ables for which the joint p.d.f. can be factored in the fol- 
lowing form for all points (x1, x9, ..., 5) € R°: 


SF (%1, X2,... X5) = B(%1, X2)h (x3, X4, X5), 


where g and / are certain nonnegative functions. Show 


that if Y, =P, (X4, X9) and Y> =19 (X3, X4, X5), then the 
random variables Y; and Y> are independent. 


17. In Example 3.9.10, use the Jacobian method (3.9.13) 
to verify that Y and Z are independent and that Eq. 
(3.9.18) is the marginal p.d-f. of Z. 


18. Let the conditional p.d.f of X given Y be gy(x|y) = 
3x2/y? for 0 <x < y and 0 otherwise. Let the marginal 
p.d.f. of Y be fo(y), where f5(y) = 0 for y < 0 but is oth- 
erwise unspecified. Let Z = X/Y. Prove that Z and Y are 
independent and find the marginal p.d-f. of Z. 


19. Let X; and X> be as in Exercise 7. Find the p.d-f. of 
Y=X,4+ Xp. 

20. If a7 = 0 in Theorem 3.9.4, show that Eq. (3.9.2) be- 
comes the same as Eq. (3.8.1) with a =a, and f = fy. 


21. In Examples 3.9.9 and 3.9.11, find the marginal p.d.f. 
of Z; = X1/X> by first transforming to Z, and Z, = X, and 
then integrating z> out of the joint p.d.f. 
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* 3.10 Markov Chains 


A popular model for systems that change over time in a random manner is the 
Markov chain model. A Markov chain is a sequence of random variables, one for 
each time. At each time, the corresponding random variable gives the state of the 
system. Also, the conditional distribution of each future state given the past states 
and the present state depends only on the present state. 


Stochastic Processes 


Occupied Telephone Lines. Suppose that a certain business office has five telephone 
lines and that any number of these lines may be in use at any given time. During 
a certain period of time, the telephone lines are observed at regular intervals of 2 
minutes and the number of lines that are being used at each time is noted. Let X, 
denote the number of lines that are being used when the lines are first observed at the 
beginning of the period; let X, denote the number of lines that are being used when 
they are observed the second time, 2 minutes later; and in general, forn =1,2,..., 
let X,, denote the number of lines that are being used when they are observed for the 
nth time. < 


Stochastic Process. A sequence of random variables Xj, X2, .. . is called a stochastic 
process or random process with discrete time parameter. The first random variable X, 
is called the initial state of the process; and for n = 2, 3, ..., the random variable X,, 
is called the state of the process at time n. 


In Example 3.10.1, the state of the process at any time is the number of lines 
being used at that time. Therefore, each state must be an integer between 0 and 5. 
Each of the random variables in a stochastic process has a marginal distribution, 
and the entire process has a joint distribution. For convenience, in this text, we will 
discuss only joint distributions for finitely many of X,, X>, ... at a time. The meaning 
of the phrase “discrete time parameter” is that the process, such as the numbers of 
occupied phone lines, is observed only at discrete or separated points in time, rather 
than continuously in time. In Sec. 5.4, we will introduce a different stochastic process 
(called the Poisson process) with a continuous time parameter. 

In a stochastic process with a discrete time parameter, the state of the process 
varies in a random manner from time to time. To describe a complete probability 
model for a particular process, it is necessary to specify the distribution for the 
initial state X, and also to specify for each n = 1, 2, ... the conditional distribution 
of the subsequent state X,,,, given X;,..., X,,. These conditional distributions are 
equivalent to the collection of conditional c.d.f.’s of the following form: 


Pr(Xy41 < b|X, =X, Xo =X7,..., Xi =X,). 


Markov Chains 


A Markov chain is a special type of stochastic process, defined in terms of the 
conditional distributions of future states given the present and past states. 


Markov Chain. A stochastic process with discrete time parameter is a Markov chain 
if, for each time n, the conditional distributions of all X,,, ; for j > 1given X;,..., X, 
depend only on X,, and not on the earlier states X,,..., X,_;. In symbols, for 
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n=1,2,...and for each b and each possible sequence of states x1, x2,..., Xn, 
Pr(X 41 < b|X, = x1, X2 =X, ..., Xp =X) = Pr(X 41 <b|X, =Xp)- 


A Markov chain is called finite if there are only finitely many possible states. 


In the remainder of this section, we shall consider only finite Markov chains. This 
assumption could be relaxed at the cost of more complicated theory and calculation. 
For convenience, we shall reserve the symbol & to stand for the number of possible 
states of a general finite Markov chain for the remainder of the section. It will also 
be convenient, when discussing a general finite Markov chain, to name the k states 
using the integers 1, ..., k. That is, for each n and /, X,, = j will mean that the chain 
is in state j at time n. In specific examples, it may prove more convenient to label the 
states in a more informative fashion. For example, if the states are the numbers of 
phone lines in use at given times (as in the example that introduced this section), we 
would label the states 0, ..., 5 even though k = 6. 

The following result follows from the multiplication rule for conditional proba- 
bilities, Theorem 2.1.2. 


For a finite Markov chain, the joint p-f. for the first n states equals 


Pr (X1 = x1, Xp =X, ..., Xp =Xy) 
= Pr(Xy = x4) Pr(X2 = x9|X1 = xy) Pr(X3 = x3|X2 = x2) --- 
Pr(X,, = Xy)|Xn—1 = Xp_1)- (3.10.1) 
Also, for each n and each m > 0, 
Pr (Xn41 = Xn41s Xn42 = Xn429 ++ +> Xntm =X%ntmlXn = Xn) 
= Pr(Xp yt = Xn gilXn = Xn) Pr(Xn 42 = Xn g21|Xn41 = Xn41) 
+ Pr(Xngm = Xn+mlXn+m—1 = Xntm—1)- (3.10.2) 


Eq. (3.10.1) is a discrete version of a generalization of conditioning in sequence that 
was illustrated in Example 3.7.18 with continuous random variables. Eq. (3.10.2) is a 
conditional version of (3.10.1) shifted forward in time. 


Shopping for Toothpaste. In Exercise 4 in Sec. 2.1, we considered a shopper who 
chooses between two brands of toothpaste on several occasions. Let X; = 1 if the 
shopper chooses brand A on the ith purchase, and let X; = 2 if the shopper chooses 
brand B on the ith purchase. Then the sequence of states X1, X2, ... is a stochas- 
tic process with two possible states at each time. The probabilities of purchase were 
specified by saying that the shopper will choose the same brand as on the previous 
purchase with probability 1/3 and will switch with probability 2/3. Since this hap- 
pens regardless of purchases that are older than the previous one, we see that this 
stochastic process is a Markov chain with 

Pr 4 = 1x, =l=.-, Pr(Xn41 = 2|X,, =l= 


? 


Qa] RR 


2 


PrXng1 = UXq = 2) = 3, PrXp 41 = 21ky = 2) = 5. < 


WlhR WIld 


Exammple 3.10.2 has an additional feature that puts it in a special class of Markov 
chains. The probability of moving from one state at time n to another state at time 
n +1 does not depend on n. 
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Transition Distributions/Stationary Transition Distributions. Consider a finite Markov 
chain with k possible states. The conditional distributions of the state at time n + 1 
given the state at time n, that is, Pr(X,,, = j|X, =i) fori, j=1,...,k andn= 
1,2, ..., are called the transition distributions of the Markov chain. If the transition 
distribution is the same for every time n (n = 1, 2,...), then the Markov chain has 
stationary transition distributions. 


When a Markov chain with k possible states has stationary transition distribu- 
tions, there exist probabilities p;; fori, 7 =1,..., k such that, for all n, 


Pr(Xn41=J1X,=!) =p forn=1,2,.... (3.10.3) 


The Markov chain in Example 3.10.2 has stationary transition distributions. For 
example, p,, = 1/3. 

In the language of multivariate distributions, when a Markov chain has stationary 
transition distributions, specified by (3.10.3), we can write the conditional p.f. of X,,41 
given X,, as 


8(ili) = Pi; (3.10.4) 


for all n, i, j. 


Occupied Telephone Lines. To illustrate the application of these concepts, we shall 
consider again the example involving the office with five telephone lines. In order 
for this stochastic process to be a Markov chain, the specified distribution for the 
number of lines that may be in use at each time must depend only on the number 
of lines that were in use when the process was observed most recently 2 minutes 
earlier and must not depend on any other observed values previously obtained. For 
example, if three lines were in use at time n, then the distribution for time n + 1 must 
be the same regardless of whether 0, 1, 2, 3, 4, or 5 lines were in use at time n — 1. 
In reality, however, the observation at time n — 1 might provide some information in 
regard to the length of time for which each of the three lines in use at time n had been 
occupied, and this information might be helpful in determining the distribution for 
time n + 1. Nevertheless, we shall suppose now that this process is a Markov chain. 
If this Markov chain is to have stationary transition distributions, it must be true that 
the rates at which incoming and outgoing telephone calls are made and the average 
duration of these telephone calls do not change during the entire period covered 
by the process. This requirement means that the overall period cannot include busy 
times when more calls are expected or quiet times when fewer calls are expected. For 
example, if only one line is in use at a particular observation time, regardless of when 
this time occurs during the entire period covered by the process, then there must be 
a specific probability p;; that exactly j lines will be in use 2 minutes later. < 


The Transition Matrix 


Shopping for Toothpaste. The notation for stationary transition distributions, p;;, 
suggests that they could be arranged in a matrix. The transition probabilities for 
Example 3.10.2 can be arranged into the following matrix: 


p=| le , 


WIN WIR 
WIR WIN 
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Every finite Markov chain with stationary transition distributions has a matrix like 
the one constructed in Example 3.10.4. 


Transition Matrix. Consider a finite Markov chain with stationary transition distribu- 
tions given by p;; = Pr(X,41 = j/|X, ={) for all n, i, j. The transition matrix of the 
Markov chain is defined to be the k x k matrix P with elements p;;. That is, 


Pir o*** Pik 
Dp a ee Dp 

Pals Se), (3.10.5) 
Pri * °° Pkk 


A transition matrix has several properties that are apparent from its defintion. 
For example, each element is nonnegative because all elements are probabilities. 
Since each row of a transition matrix is a conditional p.f. for the next state given 
some value of the current state, we have ee Pij = 1fori=1,...,k. Indeed, row 
i of the transition matrix specifies the conditional p.f. g(-|i) defined in (3.10.4). 


Stochastic Matrix. A square matrix for which all elements are nonnegative and the 
sum of the elements in each row is 1 is called a stochastic matrix. 


It is clear that the transition matrix P for every finite Markov chain with stationary 
transition probabilities must be a stochastic matrix. Conversely, every k x k stochastic 
matrix can serve as the transition matrix of a finite Markov chain with k possible states 
and stationary transition distributions. 


A Transition Matrix for the Number of Occupied Telephone Lines. Suppose that in the 
example involving the office with five telephone lines, the numbers of lines being 


used at times 1, 2, ... form a Markov chain with stationary transition distributions. 
This chain has six possible states 0, 1,...,5, where i is the state in which exactly 
i lines are being used at a given time (i = 0, 1,..., 5). Suppose that the transition 


matrix P is as follows: 


0 1 2 3 4 55 
0.1 04 0.2 01 O01 O01 
0.2 03 02 01 O1 O1 
0.1 0.22 03 0.2 01 0.1 
0.1 O01 0.2 03 0.2 0.1 
0.1 O01 O01 0.2 03 0.2 
0.1 O01 O01 01 04 0.2 


(3.10.6) 


nA BWN FR © 


(a) Assuming that all five lines are in use at a certain observation time, we shall 
determine the probability that exactly four lines will be in use at the next observation 
time. (b) Assuming that no lines are in use at a certain time, we shall determine the 
probability that at least one line will be in use at the next observation time. 


(a) This probability is the element in the matrix P in the row corresponding to the 
state 5 and the column corresponding to the state 4. Its value is seen to be 0.4. 


(b) Ifno lines are in use at a certain time, then the element in the upper left corner 
of the matrix P gives the probability that no lines will be in use at the next 
observation time. Its value is seen to be 0.1. Therefore, the probability that at 
least one line will be in use at the next observation time is 1 — 0.1= 0.9. < 
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Figure 3.28 The generation 
following {Aa, Aa}. 
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Plant Breeding Experiment. A botanist is studying a certain variety of plant that is 
monoecious (has male and female organs in separate flowers on a single plant). 
She begins with two plants I and II and cross-pollinates them by crossing male I 
with female II and female I with male II to produce two offspring for the next 
generation. The original plants are destroyed and the process is repeated as soon 
as the new generation of two plants is mature. Several replications of the study are 
run simultaneously. The botanist might be interested in the proportion of plants in 
any generation that have each of several possible genotypes for a particular gene. 
(See Example 1.6.4 on page 23.) Suppose that the gene has two alleles, A and a. 
The genotype of an individual will be one of the three combinations AA, Aa, or aa. 
When a new individual is born, it gets one of the two alleles (with probability 1/2 
each) from one of the parents, and it independently gets one of the two alleles from 
the other parent. The two offspring get their genotypes independently of each other. 
For example, if the parents have genotypes AA and Aa, then an offspring will get 
A for sure from the first parent and will get either A or a from the second parent 
with probability 1/2 each. Let the states of this population be the set of genotypes of 
the two members of the current population. We will not distinguish the set {AA, Aa} 
from {Aa, AA}. There are then six states: {AA, AA}, {AA, Aa}, {AA, aa}, {Aa, Aa}, 
{Aa, aa}, and {aa, aa}. For each state, we can calculate the probability that the next 
generation will be in each of the six states. For example, if the state is either {AA, AA} 
or {aa, aa}, the next generation will be in the same state with probability 1. If the state 
is {AA, aa}, the next generation will be in state {Aa, Aa} with probability 1. The other 
three states have more complicated transitions. 

If the current state is {Aa, Aa}, then all six states are possible for the next gen- 
eration. In order to compute the transition distribution, it helps to first compute the 
probability that a given offspring will have each of the three genotypes. Figure 3.28 
illustrates the possible offspring in this state. Each arrow going down in Fig. 3.28 
is a possible inheritance of an allele, and each combination of arrows terminating 
in a genotype has probability 1/4. It follows that the probability of AA and aa are 
both 1/4, while the probability of Aa is 1/2, because two different combinations of 
arrows lead to this offspring. In order for the next state to be {AA, AA}, both off- 
spring must be AA independently, so the probability of this transition is 1/16. The 
same argument implies that the probability of a transition to {aa, aa} is 1/16. A tran- 
sition to {AA, Aa} requires one offspring to be AA (probability 1/4) and the other to 
be Aa (probabilty 1/2). But the two different genotypes could occur in either order, 
so the whole probability of such a transition is 2 x (1/4) x (1/2) = 1/4. A similar ar- 
gument shows that a transition to {Aa, aa} also has probability 1/4. A transition to 
{AA, aa} requires one offspring to be AA (probability 1/4) and the other to be aa 
(probability 1/4). Once again, these can occur in two orders, so the whole probabil- 
ity is 2 x 1/4 x 1/4 =1/8. By subtraction, the probability of a transition to {Aa, Aa} 
must be 1 — 1/16 — 1/16 — 1/4 — 1/4 — 1/8 = 1/4. Here is the entire transition matrix, 
which can be verified in a manner similar to what has just been done: 
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{AA, AA} {AA, Aa} {AA,aa} {Aa, Aa} {Aa,aa} {aa, aa} 
{AA, AA} 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
{AA, Aa} 0.2500 0.5000 0.0000 0.2500 0.0000 0.0000 
{AA, aa} 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 
{Aa, Aa} 0.0625 0.2500 0.1250 0.2500 0.2500 0.0625 
{Aa, aa} 0.0000 0.0000 0.0000 0.2500 0.5000 0.2500 
{aa, aa} 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 


The Transition Matrix for Several Steps 


Single Server Queue. A manager usually checks the server at her store every 5 minutes 
to see whether the server is busy or not. She models the state of the server (1 = busy 
or 2 = not busy) as a Markov chain with two possible states and stationary transition 
distributions given by the following matrix: 


Busy Not busy 
_ Busy 0.9 0.1 


P= 

Not busy | 0.6 0.4 
The manager realizes that, later in the day, she will have to be away for 10 minutes 
and will miss one server check. She wants to compute the conditional distribution of 
the state two time periods in the future given each of the possible states. She reasons 
as follows: If X,, = 1 for example, then the state will have to be either 1 or 2 at time 
n+ 1 even though she does not care now about the state at time n + 1. But, if she 
computes the joint conditional distribution of X,,,; and X,,49 given X, = 1, she can 
sum over the possible values of X,,,; to get the conditional distribution of X,,, given 
X, = 1. In symbols, 


Pr(X,40 = UX, =) = Pr(X,y = 1h Xn = UX, =D 
+ Pr(X,44 = 2, Xy42 = 1X, =D. 
By the second part of Theorem 3.10.1, 
Pr(Xy41 = 1, Xn42 = UX, =D = Pr(X ya = UX, =D Pr(Kna2 = WX = D 


=0.9 x 0.9=0.81. 
Similarly, 
Pr(X 41 = 2. Xn42 — 1|X,, = 1) — Pr(Xnai — 2|X, — 1) Pr(X,42 — WXna — 2) 
=0.1 x 0.6 = 0.06. 


It follows that Pr(X,,,5 = 1|X,, = 1) =0.81 + 0.06 = 0.87, and hence Pr(X,,,5 =2|X,, = 
1) = 1 — 0.87 = 0.13. By similar reasoning, if X,, = 2, 


Pr(X,,49 = 1|X, =2) =0.6 x 0.9+0.4 x 0.6 = 0.78, 
and Pr(X,,49 = 2|X, =2) =1—0.78 =0.22. < 


Generalizing the calculations in Example 3.10.7 to three or more transitions might 
seem tedious. However, if one examines the calculations carefully, one sees a pattern 
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that will allow a compact calculation of transition distributions for several steps. 
Consider a general Markov chain with k possible states 1, ..., k and the transition 
matrix P given by Eq. (3.10.5). Assuming that the chain is in state i at a given time n, 
we shall now determine the probability that the chain will be in state j at time n + 2. 
In other words, we shall determine the conditional probability of X,4. = j given 
X,, =i. The notation for this probability is oe : 

We argue as the manager did in Example 3.10.7. Let r denote the value of X,,,; 
that is not of primary interest but is helpful to the calculation. Then 


2 i; i 
D> = Pr(X 40 = JX, =D 


k 
=) raat? ad Kae=7/k, 0 


r= 
k 
=) PG) regan =o 


T= 


k 
= 2 Pr(X 44 = r|X, =1) Pr(Xy42 = IWNXnas =r) 


r= 


k 
= > Pir Prj> 
r= 


where the third equality follows from Theorem 2.1.3 and the fourth equality follows 
from the definition of a Markov chain. 
The value of Dy? can be determined in the following manner: If the transition 


matrix P is squared, that is, if the matrix P? = PP is constructed, then the element in 
the ith row and the jth column of the matrix P? will be ae Pir P,;- Therefore, oy 


will be the element in the ith row and the jth column of P?. 
By asimilar argument, the probability that the chain will move from the state i to 


( 


the state j in three steps, or p : = Pr(X,,43 = j|X, =1), can be found by constructing 


the matrix P* = P?P. Then the probability p will be the element in the ith row and 


the jth column of the matrix P®. 
In general, we have the following result. 


Multiple Step Transitions. Let P be the transition matrix of a finite Markov chain with 
stationary transition distributions. For each m = 2, 3, ..., the mth power P” of the 
matrix P has in rowi and column j the probability De that the chain will move from 


state i to state j in m steps. a 


Multiple Step Transition Matrix. Under the conditions of Theorem 3.10.2, the ma- 
trix P” is called the m-step transition matrix of the Markov chain. 


In summary, the ith row of the m-step transition matrix gives the conditional distri- 
bution of X,,4,, given X, =i foralli=1,...,k andalln,m=1,2,.... 


The Two-Step and Three-Step Transition Matrices for the Number of Occupied Telephone 
Lines. Consider again the transition matrix P given by Eq. (3.10.6) for the Markov 
chain based on five telephone lines. We shall assume first that i lines are in use at a 
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certain time, and we shall determine the probability that exactly j lines will be in use 
two time periods later. 

If we multiply the matrix P by itself, we obtain the following two-step transition 
matrix: 


0 1 2 3 4 5 
0.14 0.23 0.20 0.15 0.16 0.12 
0.13 0.24 0.20 0.15 0.16 0.12 
0.12 0.20 0.21 0.18 0.17 0.12 
0.11 0.17 0.19 0.20 0.20 0.13 
0.11 0.16 0.16 0.18 0.24 0.15 
0.11 0.16 0.15 0.17 0.25 0.16 


(3.10.7) 


nA WN rR OC 


From this matrix we can find any two-step transition probability for the chain, such 
as the following: 


i. Iftwo lines are in use at a certain time, then the probability that four lines will 
be in use two time periods later is 0.17. 


ii. If three lines are in use at a certain time, then the probability that three lines 
will again be in use two time periods later is 0.20. 


We shall now assume that 7 lines are in use at a certain time, and we shall 
determine the probability that exactly j lines will be in use three time periods later. 

If we construct the matrix P? = P?P, we obtain the following three-step transi- 
tion matrix: 


0 1 2 3 4 = 
0.123 0.208 0.192 0.166 0.183 0.128 
0.124 0.207 0.192 0.166 0.183 0.128 
0.120 0.197 0.192 0.174 0.188 0.129 
0.117 0.186 0.186 0.179 0.199 0.133 
0.116 0.181 0.177 0.176 0.211 0.139 
0.116 0.180 0.174 0.174 0.215 0.141 


AP WN FR © 


P> (3.10.8) 


From this matrix we can find any three-step transition probability for the chain, such 
as the following: 


i. Ifall five lines are in use at a certain time, then the probability that no lines will 
be in use three time periods later is 0.116. 


ii. If one line is in use at a certain time, then the probability that exactly one line 
will again be in use three time periods later is 0.207. < 


Plant Breeding Experiment. In Example 3.10.6, the transition matrix has many zeros, 
since many of the transitions will not occur. However, if we are willing to wait two 
steps, we will find that the only transitions that cannot occur in two steps are those 
from the first state to anything else and those from the last state to anything else. 


196 


Chapter 3 Random Variables and Distributions 


Definition 
3.10.7 


Example 
3.10.10 


Definition 
3.10.8 


Here is the two-step transition matrix: 


{AA, AA} {AA, Aa} {AA,aa} {Aa, Aa} {Aa,aa} {aa, aa} 
{AA, AA} 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 
{AA, Aa} 0.3906 0.3125 0.0313 0.1875 0.0625 0.0156 
{AA, aa} 0.0625 0.2500 0.1250 0.2500 0.2500 0.0625 
{Aa, Aa} 0.1406 0.1875 0.0313 0.3125 0.1875 0.1406 
{Aa, aa} 0.0156 0.0625 0.0313 0.1875 0.3125 0.3906 
{aa, aa} 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 


Indeed, if we look at the three-step or the four-step or the general m-step transition 
matrix, the first and last rows will always be the same. < 


The first and last states in Example 3.10.9 have the property that, once the chain gets 
into one of those states, it can’t get out. Such states occur in many Markov chains 
and have a special name. 


Absorbing State. Ina Markov chain, if p;; = 1 for some state i, then that state is called 
an absorbing state. 


In Example 3.10.9, there is positive probability of getting into each absorbing state 
in two steps no matter where the chain starts. Hence, the probability is 1 that the 
chain will eventually be absorbed into one of the absorbing states if it is allowed to 
run long enough. 


The Initial Distribution 


Single Server Queue. The manager in Example 3.10.7 enters the store thinking that the 
probability is 0.3 that the server will be busy the first time that she checks. Hence, the 
probability is 0.7 that the server will be not busy. These values specify the marginal 
distribution of the state at time 1, X;. We can represent this distribution by the vector 
v = (0.3, 0.7) that gives the probabilities of the two states at time 1 in the same order 
that they appear in the transition matrix. < 


The vector giving the marginal distribution of X; in Example 3.10.10 has a special 
name. 


Probability Vector/Initial Distribution. A vector consisting of nonnegative numbers 
that add to 1 is called a probability vector. A probability vector whose coordinates 
specify the probabilities that a Markov chain will be in each of its states at time 1 is 
called the initial distribution of the chain or the intial probability vector. 


For Example 3.10.2, the initial distribution was given in Exercise 4 in Sec. 2.1 as 
v = (0.5, 0.5). 

The initial distribution and the transition matrix together determine the entire 
joint distribution of the Markov chain. Indeed, Theorem 3.10.1 shows how to con- 
struct the joint distribution of the chain from the initial probability vector and the 
transition matrix. Letting v = (vj, ..., v,) denote the initial distribution, Eq. (3.10.1) 
can be rewritten as 


Pr(Xy = x4, Xp =X, .--, Xp =Xy) = Vy Pay? Puy yxy (3.10.9) 
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The marginal distributions of states at times later than 1 can be found from the 
joint distribution. 


Marginal Distributions at Times Other Than |. Consider a finite Markov chain with 
stationary transition distributions having initial distribution v and transition matrix 
P. The marginal distribution of X,,, the state at time n, is given by the probability 
vector vP”—1, 


Proof The marginal distribution of X,, can be found from Eq. (3.10.9) by summing 
over the possible values of x1, ..., x,—1. That is, 


k k k 
Pr(X, =X) = a ue > > Ux) Pxyx9Pxyx3 °° * Px_ Xp (3.10.10) 


X,1=1 x9=1 xy=1 


The innermost sum in Eq. (3.10.10) for x; =1, ..., k involves only the first two factors 
Vy, Px,x, and produces the x7 coordinate of wP. Similarly, the next innermost sum 
over x7 =1,..., k involves only the x7 coordinate of vP and p,,,, and produces the 
x3 coordinate of vPP = vP*. Proceeding in this way through all n — 1 summations 
produces the x, coordinate of vP”~!. a 


Probabilities for the Number of Occupied Telephone Lines. Consider again the office 
with five telephone lines and the Markov chain for which the transition matrix P is 
given by Eq. (3.10.6). Suppose that at the beginning of the observation process at 
time n = 1, the probability that no lines will be in use is 0.5, the probability that one 
line will be in use is 0.3, and the probability that two lines will be in use is 0.2. Then 
the initial probability vector is v = (0.5, 0.3, 0.2, 0, 0, 0). We shall first determine the 
distribution of the number of lines in use at time 2, one period later. 
By an elementary computation it will be found that 


vP = (0.13, 0.33, 0.22, 0.12, 0.10, 0.10). 


Since the first component of this probability vector is 0.13, the probability that no 
lines will be in use at time 2 is 0.13; since the second component is 0.33, the probability 
that exactly one line will be in use at time 2 is 0.33; and so on. 

Next, we shall determine the distribution of the number of lines that will be in 
use at time 3. 

By use of Eq. (3.10.7), it can be found that 


vP? = (0.133, 0.227, 0.202, 0.156, 0.162, 0.120). 


Since the first component of this probability vector is 0.133, the probability that 
no lines will be in use at time 3 is 0.133; since the second component is 0.227, the 
probability that exactly one line will be in use at time 3 is 0.227; and so on. < 


Stationary Distributions 
A Special Initial Distribution for Telephone Lines. Suppose that the initial distribution 
for the number of occupied telephone lines is 

v = (0.119, 0.193, 0.186, 0.173, 0.196, 0.133). 


It can be shown, by matrix multiplication, that vP = v. This means that if v is the 
initial distribution, then it is also the distribution after one transition. Hence, it will 
also be the distribution after two or more transitions as well. < 
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Stationary Distribution. Let P be the transition matrix for a Markov chain. A proba- 
bility vector v that satisfies vP = v is called a stationary distribution for the Markov 
chain. 


The initial distribution in Example 3.10.12 is a stationary distribution for the tele- 
phone lines Markov chain. If the chain starts in this distribution, the distribution stays 
the same at all times. Every finite Markov chain with stationary transition distribu- 
tions has at least one stationary distribution. Some chains have a unique stationary 
distribution. 


Note: A Stationary Distribution Does Not Mean That the Chain is Not Moving. It 
is important to note that vP” gives the probabilities that the chain is in each of 
its states after n transitions, calculated before the initial state of the chain or any 
transitions are observed. These are different from the probabilities of being in the 
various states after observing the initial state or after observing any of the intervening 
transitions. In addition, a stationary distribution does not imply that the Markov 
chain is staying put. If a Markov chain starts in a stationary distribution, then for 
each state 7, the probability that the chain is in state i after n transitions is the same 
as the probability that it is state 7 at the start. But the Markov chain can still move 
around from one state to the next at each transition. The one case in which a Markov 
chain does stay put is after it moves into an absorbing state. A distribution that is 
concentrated solely on absorbing states will necessarily be stationary because the 
Markov chain will never move if it starts in such a distribution. In such cases, all of 
the uncertainty surrounds the initial state, which will also be the state after every 
transition. 


Stationary Distributions for the Plant Breeding Experiment. Consider again the experi- 
ment described in Example 3.10.6. The first and sixth states, {AA, AA} and {aa, aa}, 
respectively, are absorbing states. It is easy to see that every initial distribution of the 
form v = (p, 0, 0, 0, 0, 1 — p) for 0 < p <1 has the property that vP = v. Suppose 
that the chain is in state 1 with probability p and in state 6 with probability 1 — p 
at time 1. Because these two states are absorbing states, the chain will never move 
and the event X; = 1 is the same as the event that X,, = 1 for all n. Similarly, X; =6 
is the same as X,, = 6. So, thinking ahead to where the chain is likely to be after n 
transitions, we would also say that it will be in state 1 with probability p and in state 
6 with probability 1 — p. J 


Method for Finding Stationary Distributions We can rewrite the equation vP = v 
that defines stationary distributions as v[P — I] = 0, where I isak x k identity matrix 
and 0 is a k-dimensional vector of all zeros. Unfortunately, this system of equations 
has lots of solutions even if there is a unique stationary distribution. The reason is 
that whenever v solves the system, so does cv for all real c (including c = 0). Even 
though the system has k equations for k variables, there is at least one redundant 
equation. However, there is also one missing equation. We need to require that the 
solution vector v has coordinates that sum to 1. We can fix both of these problems by 
replacing one of the equations in the original system by the equation that says that 
the coordinates of v sum to 1. 

To be specific, define the matrix G to be P — I with its last column replaced by 
a column of all ones. Then, solve the equation 
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vG=(0,...,0, 1). (3.10.11) 


If there is a unique stationary distribution, we will find it by solving (3.10.11). In this 
case, the matrix G will have an inverse G~! that satisfies 


GG'=G'G=I. 
The solution of (3.10.11) will then be 
v=(0,...,0,DG7, 


which is easily seen to be the bottom row of the matrix G~!. This was the method 
used to find the stationary distribution in Example 3.10.12. If the Markov chain 
has multiple stationary distributions, then the matrix G will be singular, and this 
method will not find any of the stationary distributions. That is what would happen 
in Example 3.10.13 if one were to apply the method. 


Stationary Distribution for Toothpaste Shopping. Consider the transition matrix P 
given in Example 3.10.4. We can construct the matrix G as follows: 
2 2 


” 
2 2 =2 i 
p-1=| — |: hence G=| a a 


3 3 


3 
The inverse of G is 


We now see that the stationary distribution is the bottom row of G~!, v = (1/2, 1/2). 
<4 


There is a special case in which it is known that a unique stationary distribution 
exists and it has special properties. 


If there exists m such that every element of P” is strictly positive, then 


e the Markov chain has a unique stationary distribution v, 
e lim,_... P” is a matrix with all rows equal to v, and 


* no matter with what distribution the Markov chain starts, its distribution after 
n steps converges to v aS n — od. rT] 


We shall not prove Theorem 3.10.4, although some evidence for the second 
claim can be seen in Eq. (3.10.8), where the six rows of P? are much more alike 
than the rows of P and they are very similar to the stationary distribution given in 
Example 3.10.12. The third claim in Theorem 3.10.4 actually follows easily from the 
second claim. In Sec. 12.5, we shall introduce a method that makes use of the third 
claim in Theorem 3.10.4 in order to approximate distributions of random variables 
when those distributions are difficult to calculate exactly. 

The transition matrices in Examples 3.10.2, 3.10.5, and 3.10.7 satisfy the condi- 
tions of Theorem 3.10.4. The following example has a unique stationary distribution 
but does not satisfy the conditions of Theorem 3.10.4. 


Alternating Chain. Let the transition matrix for a two-state Markov chain be 


r-[ 
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The matrix G is easy to construct and invert, and we find that the unique stationary 
distribution is v = (0.5, 0.5). However, as m increases, P” alternates between P and 
the 2 x 2 identity matrix. It does not converge and never does it have all elements 
strictly positive. If the initial distribution is (v1, v2), the distribution after n steps 
alternates between (vj, v2) and (v9, vj). J 


Another example that fails to satisfy the conditions of Theorem 3.10.4 is the 
gambler’s ruin problem from Sec. 2.4. 


Gambler’s Ruin. In Sec. 2.4, we described the gambler’s ruin problem, in which a 
gambler wins one dollar with probability p and loses one dollar with probability 1 — p 
on each play of a game. The sequence of amounts held by the gambler through the 
course of those plays forms a Markov chain with two absorbing states, namely, 0 and 
k. There are k — 1 other states, namely, 1, ..., k — 1. (This notation violates our use of 
k to stand for the number of states, which is k + 1 in this example. We felt this was less 
confusing than switching from the original notation of Sec. 2.4.) The transition matrix 
has first and last row being (1, 0,..., 0) and (0, ..., 1), respectively. The ith row (for 
i=1,...,k—1) has 0 everywhere except in coordinate i — 1 where it has 1 — p and 
in coordinate i + 1 where it has p. Unlike Example 3.10.15, this time the sequence 
of matrices P” converges but there is no unique stationary distribution. The limit 
of P” has as its last column the numbers ap, ... , a,, where a; is the probability that 
the fortune of a gambler who starts with i dollars reaches k dollars before it reaches 
0 dollars. The first column of the limit has the numbers 1 — ap, ..., 1 — aq and the 
rest of the limit matrix is all zeros. The stationary distributions have the same form 
as those in Example 3.10.13, namely, all probability is in the absorbing states. < 


Summary 


A Markov chain is a stochastic process, a sequence of random variables giving the 
states of the process, in which the conditional distribution of the state at the next 
time given all of the past states depends on the past states only through the most 
recent state. For Markov chains with finitely many states and stationary transition 
distributions, the transitions over time can be described by a matrix giving the prob- 
abilities of transition from the state indexing the row to the state indexing the column 
(the transition matrix P). The initial probability vector v gives the distribution of the 
state at time 1. The transition matrix and initial probability vector together allow 
calculation of all probabilities associated with the Markov chain. In particular, P” 
gives the probabilities of transitions over n time periods, and vP” gives the distri- 
bution of the state at time n + 1. A stationary distribution is a probability vector v 
such that vP = y. Every finite Markov chain with stationary transition distributions 
has at least one stationary distribution. For many Markov chains, there is a unique 
stationary distribution and the distribution of the chain after n transitions converges 
to the stationary distribution as n goes to oo. 


1. Consider the Markov chain in Example 3.10.2 with ini- a. Find the probability vector specifying the probabili- 
tial probability vector v = (1/2, 1/2). ties of the states at time n = 2. 


b. Find the two-step transition matrix. 


2. Suppose that the weather can be only sunny or cloudy 
and the weather conditions on successive mornings form 
a Markov chain with stationary transition probabilities. 
Suppose also that the transition matrix is as follows: 


Sunny Cloudy 
Sunny 0.7 0.3 
Cloudy 0.6 0.4 


a. If it is cloudy on a given day, what is the probability 
that it will also be cloudy the next day? 


b. If it is sunny on a given day, what is the probability 
that it will be sunny on the next two days? 


c. If it is cloudy on a given day, what is the probability 
that it will be sunny on at least one of the next three 
days? 


3. Consider again the Markov chain described in Exer- 
cise 2. 


a. If it is sunny on a certain Wednesday, what is the 
probability that it will be sunny on the following 
Saturday? 


b. If it is cloudy on a certain Wednesday, what is the 
probability that it will be sunny on the following 
Saturday? 


4. Consider again the conditions of Exercises 2 and 3. 


a. If it is sunny on a certain Wednesday, what is the 
probability that it will be sunny on both the following 
Saturday and Sunday? 


b. If it is cloudy on a certain Wednesday, what is the 
probability that it will be sunny on both the following 
Saturday and Sunday? 


5. Consider again the Markov chain described in Exer- 
cise 2. Suppose that the probability that it will be sunny 
on a certain Wednesday is 0.2 and the probability that it 
will be cloudy is 0.8. 


a. Determine the probability that it will be cloudy on 
the next day, Thursday. 


b. Determine the probability that it will be cloudy on 
Friday. 

c. Determine the probability that it will be cloudy on 
Saturday. 


6. Suppose that a student will be either on time or late for 
a particular class and that the events that he is on time or 
late for the class on successive days form a Markov chain 
with stationary transition probabilities. Suppose also that 
if he is late on a given day, then the probability that he will 
be on time the next day is 0.8. Furthermore, if he is on time 
on a given day, then the probability that he will be late the 
next day is 0.5. 
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a. If the student is late on a certain day, what is the 
probability that he will be on time on each of the next 
three days? 


b. If the student is on time on a given day, what is the 
probability that he will be late on each of the next 
three days? 


7. Consider again the Markov chain described in Exer- 
cise 6. 


a. If the student is late on the first day of class, what is 
the probability that he will be on time on the fourth 
day of class? 


b. Ifthe student is on time on the first day of class, what 
is the probability that he will be on time on the fourth 
day of class? 


8. Consider again the conditions of Exercises 6 and 7. 
Suppose that the probability that the student will be late 
on the first day of class is 0.7 and that the probability that 
he will be on time is 0.3. 


a. Determine the probability that he will be late on the 
second day of class. 


b. Determine the probability that he will be on time on 
the fourth day of class. 


9. Suppose that a Markov chain has four states 1, 2, 3, 4 
and stationary transition probabilities as specified by the 
following transition matrix: 


1 2 3 4 
1/4 1/44 0 1/2 
0 1 0 0 


1 
2 

3/1/72 0 1/2 O 
4] 1/4 1/4 1/4 1/4 


a. If the chain is in state 3 at a given time n, what is the 
probability that it will be in state 2 at time n + 2? 


b. Ifthe chain is in state 1 at a given time n, what is the 
probability that it will be in state 3 at time n + 3? 


10. Let X, denote the initial state at time 1 of the Markov 
chain for which the transition matrix is as specified in 
Exercise 5, and suppose that the initial probabilities are 
as follows: 


Pr(X; = 1) = 1/8, Pr(X, = 2) = 1/4, 
Pr(X, = 3) =3/8, Pr(X, =4) = 1/4. 
Determine the probabilities that the chain will be in 


states 1, 2,3, and 4 at time n for each of the following values 
of n: (a) n = 2; (b) n = 3; (C)n =4. 


11. Each time that a shopper purchases a tube of tooth- 
paste, she chooses either brand A or brand B. Suppose that 
the probability is 1/3 that she will choose the same brand 
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chosen on her previous purchase, and the probability is 
2/3 that she will switch brands. 


a. Ifher first purchase is brand A, what is the probability 
that her fifth purchase will be brand B? 


b. Ifher first purchase is brand B, what is the probability 
that her fifth purchase will be brand B? 


12. Suppose that three boys A, B, and C are throwing a 
ball from one to another. Whenever A has the ball, he 
throws it to B with a probability of 0.2 and to C with a 
probability of 0.8. Whenever B has the ball, he throws it 
to A with a probability of 0.6 and to C with a probability of 
0.4. Whenever C has the ball, he is equally likely to throw 
it to either A or B. 


a. Consider this process to be a Markov chain and con- 
struct the transition matrix. 


b. If each of the three boys is equally likely to have the 
ball at a certain time n, which boy is most likely to 
have the ball at time n + 2? 


13. Suppose that a coin is tossed repeatedly in such a way 
that heads and tails are equally likely to appear on any 
given toss and that all tosses are independent, with the 
following exception: Whenever either three heads or three 
tails have been obtained on three successive tosses, then 
the outcome of the next toss is always of the opposite type. 
At time n (n > 3), let the state of this process be specified 
by the outcomes on tosses n — 2, n — 1, and n. Show that 
this process is a Markov chain with stationary transition 
probabilities and construct the transition matrix. 


14. There are two boxes A and B, each containing red and 
green balls. Suppose that box A contains one red ball and 
two green balls and box B contains eight red balls and two 
green balls. Consider the following process: One ball is 
selected at random from box A, and one ball is selected 
at random from box B. The ball selected from box A is 


ee 


1. Suppose that X and Y are independent random vari- 
ables, that X has the uniform distribution on the integers 
1, 2, 3, 4, 5 (discrete), and that Y has the uniform distribu- 
tion on the interval [0, 5] (continuous). Let Z be arandom 
variable such that Z = X with probability 1/2 and Z = Y 
with probability 1/2. Sketch the c.d.f. of Z. 


2. Suppose that X and Y are independent random vari- 
ables. Suppose that X has a discrete distribution concen- 
trated on finitely many distinct values with p.f. f;. Suppose 
that Y has a continuous distribution with p.d.f. f5. Let 
Z = X + Y. Show that Z has a continuous distribution and 


then placed in box B and the ball selected from box B is 
placed in box A. These operations are then repeated indef- 
initely. Show that the numbers of red balls in box A form a 
Markov chain with stationary transition probabilities, and 
construct the transition matrix of the Markov chain. 


15. Verify the rows of the transition matrix in Exam- 
ple 3.10.6 that correspond to current states {AA, Aa} and 
{Aa, aa}. 


16. Let the initial probability vector in Example 3.10.6 be 
v = (1/16, 1/4, 1/8, 1/4, 1/4, 1/16). Find the probabilities 
of the six states after one generation. 


17. Return to Example 3.10.6. Assume that the state at 
time n — 1 is {Aa, aa}. 


a. Suppose that we learn that X,,,,is {AA, aa}. Find the 
conditional distribution of X,,. (That is, find all the 
probabilities for the possible states at time n given 
that the state at time n + 1is {AA, aa}.) 


b. Suppose that we learn that X,,,, is {aa, aa}. Find the 
conditional distribution of X,,. 


18. Return to Example 3.10.13. Prove that the stationary 
distributions described there are the only stationary dis- 
tributions for that Markov chain. 


19. Find the unique stationary distribution for the Markov 
chain in Exercise 2. 


20. The unique stationary distribution in Exercise 9 is v = 
(0, 1, 0, 0). This is an instance of the following general re- 
sult: Suppose that a Markov chain has exactly one absorb- 
ing state. Suppose further that, for each non-absorbing 
state k, there is n such that the probability is positive of 
moving from state k to the absorbing state inn steps. Then 
the unique stationary distribution has probability 1 in the 
absorbing state. Prove this result. 
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find its p.d.f. Hint: First find the conditional p.d.f. of Z given 
X= x. 


3. Suppose that the random variable X has the following 
c.d.f: 


0 for x <0, 

ae for 0 <x <1, 
FO=) 3 

5x — 5 for1 <x <2, 

1 for x > 2. 


Verify that X has a continuous distribution, and determine 
the p.d.f. of X. 


4. Suppose that the random variable X has a continuous 
distribution with the following p.d.f: 


f(x)= ser for —oo <x <o@. 


Determine the value xp such that F (xp) = 0.9, where F(x) 
is the c.d.f. of X. 


5. Suppose that X; and Xp are iid. random variables, 
and that each has the uniform distribution on the interval 
[0, 1]. Evaluate Pr(X? +X2< if); 


6. For each value of p > 1, let 
= 1 
c(p) = dX oP: 


Suppose that the random variable X has a discrete distri- 
bution with the following p.f.: 


for x, 2, oc5 


f@)= cpa? 
a. For each fixed positive integer n, determine the prob- 
ability that X will be divisible by n. 


b. Determine the probability that X will be odd. 


7. Suppose that X; and X> are iid. random variables, 
each of which has the p.f. f(x) specified in Exercise 6. 
Determine the probability that X, + X> will be even. 


8. Suppose that an electronic system comprises four com- 
ponents, and let X ; denote the time until component j fails 
to operate (j = 1, 2, 3, 4). Suppose that X 1, X, X3, and X4 
are 1.1.d. random variables, each of which has a continuous 
distribution with c.d.f. F(x). Suppose that the system will 
operate as long as both component 1 and at least one of 
the other three components operate. Determine the c.d.f. 
of the time until the system fails to operate. 


9. Suppose that a box contains a large number of tacks 
and that the probability X that a particular tack will land 
with its point up when it is tossed varies from tack to tack 


in accordance with the following p.d.f.: 
2a—.x) for0O<x <1, 
0 otherwise. 


fee 


Suppose that a tack is selected at random from the box 
and that this tack is then tossed three times independently. 
Determine the probability that the tack will land with its 
point up on all three tosses. 


10. Suppose that the radius X of a circle is a random 
variable having the following p.d.f.: 
fx)= | 4Gx+1) for0<x <2, 
0 otherwise. 


Determine the p.d.f. of the area of the circle. 
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11. Suppose that the random variable X has the following 
p.df.: 


2e-2*  forx >0 
f(x) = | i 
0 otherwise. 


Construct a random variable Y = r(X) that has the uni- 
form distribution on the interval [0, 5]. 


12. Suppose that the 12 random variables Xj, ..., Xj are 
i.i.d. and each has the uniform distribution on the interval 
[0, 20]. For 7 =0,1,..., 19, let T, denote the interval (Gi, 
j +1). Determine the probability that none of the 20 dis- 
joint intervals /; will contain more than one of the random 
variables Xj, ..., X}2. 


13. Suppose that the joint distribution of X and Y is uni- 
form over a set A in the xy-plane. For which of the follow- 
ing sets A are X and Y independent? 
a. A circle with a radius of 1 and with its center at the 
origin 
b. A circle with a radius of 1 and with its center at the 
point (3, 5) 
c. A square with vertices at the four points (1, 1), 
(1, —1), (-1, —1), and (—1, 1) 
d. A rectangle with vertices at the four points (0, 0), 
(0, 3), (1, 3), and (1, 0) 


e. A square with vertices at the four points (0, 0), (1, 1), 
(0, 2), and (—1, 1) 


14. Suppose that X and Y are independent random vari- 
ables with the following p.d.f’s: 


1 for0O<x <1, 
0 otherwise, 


fi) -|{ 


1 
A= |” for0<y< : 
0 otherwise. 


Determine the value of Pr(X > Y). 


15. Suppose that, on a particular day, two persons A and 
B arrive at a certain store independently of each other. 
Suppose that A remains in the store for 15 minutes and B 
remains in the store for 10 minutes. If the time of arrival 
of each person has the uniform distribution over the hour 
between 9:00 a.m. and 10:00 a.m., what is the probability 
that A and B will be in the store at the same time? 


16. Suppose that X and Y have the following joint p.d.f: 


2(x+y) for0O<x<y<1l, 
0 otherwise. 


fen =| 


Determine (a) Pr(X < 1/2); (b) the marginal p.d-f. of X; 
and (c) the conditional p.d.f. of Y given that X = x. 
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17. Suppose that X and Y are random variables. The mar- 
ginal p.d.f. of X is 


2 
‘oe | 3x? for0 <x <1, 
0 otherwise. 


Also, the conditional p.d.f. of Y given that X = x is 


3° for 0 
e(y|x) = rch orU<y<x, 
0 otherwise. 


Determine (a) the marginal p.d.f. of Y and (b) the condi- 
tional p.d.f. of X given that Y = y. 


18. Suppose that the joint distribution of X and Y is uni- 
form over the region in the xy-plane bounded by the four 
lines x = —1,x=1, y=x+1, and y=x —1. Determine 
(a) Pr(XY > 0) and (b) the conditional p.d.f. of Y given 
that X =x. 


19. Suppose that the random variables X, Y, and Z have 
the following joint p.d-f.: 


flr nz={P forO<x<y<z<l, 


0 otherwise. 
Determine the univariate marginal p.d.f.’s of X, Y, and Z. 


20. Suppose that the random variables X, Y, and Z have 
the following joint p.d-f.: 

2 for0O<x<y<1land0 <z<l, 

0 otherwise. 


fen y2={ 


Evaluate Pr(3X > Y|1 < 4Z < 2). 


21. Suppose that X and Y are i.i.d. random variables, and 
that each has the following p.d.f: 


e* forx > 0, 
i 0 otherwise. 
Also, let U = X/(X +, Y) and V=X+Y. 
a. Determine the joint p.d.f. of U and V. 
b. Are U and V independent? 
22. Suppose that the random variables X and Y have the 
following joint p.d.f.: 
8xy for0O<x<y<1, 
a= | 0 otherwise. 
Also, let U = X/Y andV=Y. 
a. Determine the joint p.d.f. of U and V. 
b. Are X and Y independent? 
c. Are U and V independent? 
23. Suppose that X;,..., X,, are ii.d. random variables, 
each having the following c.d.f.: 
0 for x < 0, 


1l-e* 


F(x) = | 


for x > 0. 


Let Y; = min{X,,..., X,}and Y, =max{X,,..., X,,}. De- 
termine the conditional p.d-f. of Y, given that Y, = y,. 


24. Suppose that X;, X, and X3 form a random sample of 
three observations from a distribution having the follow- 
ing p.d.f.: 


2x for0<x <1, 
0 otherwise. 


mx | 


Determine the p.d.f. of the range of the sample. 


25. In this exercise, we shall provide an approximate jus- 
tification for Eq. (3.6.6). First, remember that if a and b 
are close together, then 


b 
/ r(t)dt © (b—a)r (+), 


Throughout this problem, assume that X and Y have joint 
p.df. f. 
a. Use (3.11.1) to approximate Pr(y —e <Y¥ <y+e). 


b. Use (3.11.1) with r(t) = f(s, t) for fixed s to approx- 
imate 


(3.11.1) 


Prix <x andy—e€<Y<y+e) 


x yte 
= / Ff (s, t) dt ds. 
—oo Jy—€ 


c. Show that the ratio of the approximation in part (b) 
to the approximation in part (a) is [*,, g1(s|y) ds. 


26. Let X;, X2 be two independent random variables each 
with p.d.f. f;(x) =e * for x > Oand f;(x) = Oforx <0. Let 
Z = X, — X, and W = Xj/X>. 

a. Find the joint p.d.f. of X, and Z. 

b. Prove that the conditional p.d.f. of X; given Z = 0 is 


2e-2%1_ for x > 0, 
0 otherwise. 


g1(x4|0) = { 


c. Find the joint p.d.f. of X; and W. 
d. Prove that the conditional p.d.f. of X; given W = Lis 


2 
nao = | Axje"1 = for x1 . 0, 
otherwise. 


e. Notice that {Z =0}={W = 1}, but the conditional 
distribution of X, given Z = 0 is not the same as the 
conditional distribution of X, given W = 1. This dis- 
crepancy is known as the Borel paradox. In light 
of the discussion that begins on page 146 about 
how conditional p.d.f’s are not like conditioning on 
events of probability 0, show how “Z very close to 
0” is not the same as “W very close to 1.” Hint: Draw 
a set of axes for x, and x, and draw the two sets 
{(x4, X2) : [xy — XQ] < €} and {(xy, xg) : |xy/x2 — 1] < €} 
and see how much different they are. 


27. Three boys A, B, and C are playing table tennis. In 
each game, two of the boys play against each other and 
the third boy does not play. The winner of any given game 
n plays again in game n + 1 against the boy who did not 
play in game n, and the loser of game n does not play in 
game n + 1. The probability that A will beat B in any game 
that they play against each other is 0.3, the probability that 
A will beat C is 0.6, and the probability that B will beat 
C is 0.8. Represent this process as a Markov chain with 
stationary transition probabilities by defining the possible 
states and constructing the transition matrix. 
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28. Consider again the Markov chain described in Exer- 
cise 27. (a) Determine the probability that the two boys 
who play against each other in the first game will play 
against each other again in the fourth game. (b) Show that 
this probability does not depend on which two boys play 
in the first game. 


29. Find the unique stationary distribution for the Markov 
chain in Exercise 27. 
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4.1 The Expectation of a Random Variable 


The distribution of a random variable X contains all of the probabilistic infor- 
mation about X. The entire distribution of X, however, is usually too cumbersome 
for presenting this information. Summaries of the distribution, such as the average 
value, or expected value, can be useful for giving people an idea of where we expect 
X to be without trying to describe the entire distribution. The expected value also 
plays an important role in the approximation methods that arise in Chapter 6. 


Expectation for a Discrete Distribution 


Fair Price for a Stock. An investor is considering whether or not to invest $18 per 
share in a stock for one year. The value of the stock after one year, in dollars, will be 
18+ X, where X is the amount by which the price changes over the year. At present 
X is unknown, and the investor would like to compute an “average value” for X in 
order to compare the return she expects from the investment to what she would get 
by putting the $18 in the bank at 5% interest. < 


The idea of finding an average value as in Example 4.1.1 arises in many applications 
that involve a random variable. One popular choice is what we call the mean or 
expected value or expectation. 

The intuitive idea of the mean of a random variable is that it is the weighted 
average of the possible values of the random variable with the weights equal to the 
probabilities. 


Stock Price Change. Suppose that the change in price of the stock in Example 4.1.1 
is arandom variable X that can assume only the four different values —2, 0, 1, and 
4, and that Pr(X = —2) = 0.1, Pr(X =0) =0.4, Pr(X = 1) = 0.3, and Pr(X = 4) = 0.2. 
Then the weighted avarage of these values is 


—2(0.1) + 0(0.4) + 1(0.3) + 4(0.2) = 0.9. 


The investor now compares this with the interest that would be earned on $18 at 5% 
for one year, which is 18 x 0.05 = 0.9 dollars. From this point of view, the price of $18 
seems fair. < 
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Definition 
4.1.1 


Example 
4.1.3 


Definition 
4.1.2 


Example 
4.1.4 


The calculation in Example 4.1.2 generalizes easily to every random variable that 
assumes only finitely many values. Possible problems arise with random variables 
that assume more than finitely many values, especially when the collection of possible 
values is unbounded. 


Mean of Bounded Discrete Random Variable. Let X be a bounded discrete random 
variable whose p.f. is f. The expectation of X, denoted by E(X), is a number defined 
as follows: 


E(X) = )° xf (x). (4.1.1) 
All x 


The expectation of X is also referred to as the mean of X or the expected value of X. 


In Example 4.1.2, E(X) = 0.9. Notice that 0.9 is not one of the possible values of X 
in that example. This is typically the case with discrete random variables. 


Bernoulli Random Variable. Let X have the Bernoulli distribution with parameter p, 
that is, assume that X takes only the two values 0 and 1 with Pr(X = 1) = p. Then the 
mean of X is 


E(X)=0x (1— p)+1x p=p. < 


If X is unbounded, it might still be possible to define E(X) as the weighted 
average of its possible values. However, some care is needed. 


Mean of General Discrete Random Variable. Let X be a discrete random variable whose 
p.f. is f. Suppose that at least one of the following sums is finite: 


Y> xf(), xe, (4.1.2) 


Positive x Negative x 


Then the mean, expectation, or expected value of X is said to exist and is defined to be 


EOO= >) xf. (4.1.3) 


Allx 
If both of the sums in (4.1.2) are infinite, then E(X) does not exist. 


The reason that the expectation fails to exist if both of the sums in (4.1.2) are 
infinite is that, in such cases, the sum in (4.1.3) is not well-defined. It is known from 
calculus that the sum of an infinite series whose positive and negative terms both 
add to infinity either fails to converge or can be made to converge to many different 
values by rearranging the terms in different orders. We don’t want the meaning of 
expected value to depend on arbitrary choices about what order to add numbers. If 
only one of two sums in (4.1.3) is infiinte, then the expected value is also infinite with 
the same sign as that of the sum that is infinite. If both sums are finite, then the sum 
in (4.1.3) converges and doesn’t depend on the order in which the terms are added. 


The Mean of X Does Not Exist. Let X be a random variable whose p.f. is 


1 
f(x) = 4 2\x|(x] + D 
0 otherwise. 
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It can be verified that this function satisfies the conditions required to be a p.f. The 
two sums in (4.1.2) are 


65 ‘ - ‘ 
~ x =-—oo and > x———_ =; 
wa, 2leMxl +) 26d) 

hence, E(X) does not exist. < 


An Infinite Mean. Let X be a random variable whose p.f. is 


— ifx=1,2,3,..., 
f=) x(x4+D 


0 otherwise. 


The sum over negative values in Eq. (4.1.2) is 0, so the mean of X exists and is 


= 1 
E(X)= x ———_ = © 
~ 2s x(x + 1) 
x=1 
We say that the mean of X is infinite in this case. <1 


Note: The Expectation of X Depends Only on the Distribution of X¥. Although 
E(X) is called the expectation of X, it depends only on the distribution of X. Every 
two random variables that have the same distribution will have the same expectation 
even if they have nothing to do with each other. For this reason, we shall often refer 
to the expectation of a distribution even if we do not have in mind a random variable 
with that distribution. 


Expectation for a Continuous Distribution 


The idea of computing a weighted average of the possible values can be generalized 
to continuous random variables by using integrals instead of sums. The distinction 
between bounded and unbounded random variables arises in this case for the same 
reasons. 


Mean of Bounded Continuous Random Variable. Let X be a bounded continuous 
random variable whose p.d.f. is f. The expectation of X, denoted E(X), is defined 
as follows: 


E(X)= a xf (x) dx. (4.1.4) 


Once again, the expectation is also called the mean or the expected value. 
Expected Failure Time. An appliance has a maximum lifetime of one year. The time 
X until it fails is a random variable with a continuous distribution having p.d.f. 


= for0 <x <1, 


0 otherwise. 
Then 


1 1 2 
EQ) = | xQx)dx= [ 2x? dx = =. 
0 0 3 


We can also say that the expectation of the distribution with p.d.f. f is 2/3. J 
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For general continuous random variables, we modify Definition 4.1.2. 


Mean of General Continuous Random Variable. Let X be a continuous random variable 
whose p.d.f. is f. Suppose that at least one of the following integrals is finite: 


oe) 0 
i xf (x)dx, / xf (x)dx. (4.1.5) 
0 (oe) 


Then the mean, expectation, or expected value of X is said to exist and is defined to 
be 


E(X)= [- xf (x)dx. (4.1.6) 


If both of the integrals in (4.1.5) are infinite, then E(X) does not exist. 


Failure after Warranty. A product has a warranty of one year. Let X be the time at 
which the product fails. Suppose that X has a continuous distribution with the p.d.f. 


0 forx <1, 


ro= 3 forx >1. 


x 
The expected time to failure is then 
[o.e) ioe) 
EQ) = | rds = | 2 ax =2. < 
1 x 1 


Xx 


A Mean That Does Not Exist. Suppose that a random variable X has a continuous 
distribution for which the p.d.f. is as follows: 
1 


f@= nd +a for —c0o <x <o. (4.1.7) 


This distribution is called the Cauchy distribution. We can verify the fact that 
| lone f(x) dx = 1 by using the following standard result from elementary calculus: 


1 1 
— tan x= 
dx 1+ x? 


The two integrals in (4.1.5) are 


for —oo<x <M. 


oe x 0 x 
/ ———dx=oo and / - dx =—0o; 
0 w-p2") ~co W(1+ x?) 


hence, the mean of X does not exist. <J 


Interpretation of the Expectation 


Relation of the Mean to the Center of Gravity The expectation of a random 
variable or, equivalently, the mean of its distribution can be regarded as being the 
center of gravity of that distribution. To illustrate this concept, consider, for example, 
the p.f. sketched in Fig. 4.1. The x-axis may be regarded as a long weightless rod to 
which weights are attached. If a weight equal to f(x) is attached to this rod at each 
point x;, then the rod will be balanced if it is supported at the point E(x). 

Now consider the p.d.f. sketched in Fig. 4.2. In this case, the x-axis may be 
regarded as a long rod over which the mass varies continuously. If the density of 


Figure 4.1 The mean of a 
discrete distribution. 


Figure 4.2 The mean of a 
continuous distribution. 
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fo) 
f(x 3) 
Flr) 


fxs) 
fly) : 


the rod at each point x is equal to f(x), then the center of gravity of the rod will be 
located at the point E(X), and the rod will be balanced if it is supported at that point. 

It can be seen from this discussion that the mean of a distribution can be affected 
greatly by even a very small change in the amount of probability that is assigned to 
a large value of x. For example, the mean of the distribution represented by the p.f. 
in Fig. 4.1 can be moved to any specified point on the x-axis, no matter how far from 
the origin that point may be, by removing an arbitrarily small but positive amount 
of probability from one of the points x; and adding this amount of probability at a 
point far enough from the origin. 

Suppose now that the p.f. or p.d.f. f of some distribution is symmetric with respect 
to a given point xg on the x-axis. In other words, suppose that f(x9 + 5) = f (xp — 4) 
for all values of 5. Also assume that the mean E(X) of this distribution exists. In 
accordance with the interpretation that the mean is at the center of gravity, it follows 
that E(X) must be equal to xg, which is the point of symmetry. The following example 
emphasizes the fact that it is necessary to make certain that the mean E(X) exists 
before it can be concluded that E(X) = xo. 


The Cauchy Distribution. Consider again the p.d.f. specified by Eq. (4.1.7), which is 
sketched in Fig. 4.3. This p.d.f. is symmetric with respect to the point x = 0. Therefore, 
if the mean of the Cauchy distribution existed, its value would have to be 0. However, 
we saw in Example 4.1.8 that the mean of X does not exist. 

The reason for the nonexistence of the mean of the Cauchy distribution is as 
follows: When the curve y = f(x) is sketched as in Fig. 4.3, its tails approach the x- 
axis rapidly enough to permit the total area under the curve to be equal to 1. On 
the other hand, if each value of f(x) is multiplied by x and the curve y = xf (x) is 
sketched, as in Fig. 4.4, the tails of this curve approach the x-axis so slowly that the 
total area between the x-axis and each part of the curve is infinite. <1 
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Figure 4.3 The p.d-f. of a 
Cauchy distribution. 


Figure 4.4 The curve 
y =xf (x) for the Cauchy 
distribution. 
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The Expectation of a Function 


Failure Rate and Time to Failure. Suppose that appliances manufactured by a particular 
company fail at a rate of X per year, where X is currently unknown and hence is a 
random variable. If we are interested in predicting how long such an appliance will 
last before failure, we might use the mean of 1/X. How can we calculate the mean 
of Y =1/X? < 


Functions of a Single Random Variable If X is a random variable for which the 
p.d.f. is f, then the expectation of each real-valued function r(X) can be found by 
applying the definition of expectation to the distribution of r(X) as follows: Let 
Y =r(X), determine the probability distribution of Y, and then determine E(Y) 
by applying either Eq. (4.1.1) or Eq. (4.1.4). For example, suppose that Y has a 
continuous distribution with the p.d.f. g. Then 


[oe) 


Elr|= £0) = [ yg(y) dy, (4.1.8) 


if the expectation exists. 


Failure Rate and Time to Failure. In Example 4.1.10, suppose that the p.d.f. of X is 


3x2 if0<x <1, 
0 otherwise. 


f= | 


Theorem 
4.1.1 
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Let r(x) = 1/x. Using the methods of Sec. 3.8, we can find the p.d.f. of Y = r(X) as 


4 : 
co={ ify>1, 
0 otherwise. 
The mean of Y is then 
oo 3 
E(Y) =) y3y 4dy =, < 
0 2 


Although the method of Example 4.1.11 can be used to find the mean of a 
continuous random variable, it is not actually necessary to determine the p.d.f. of 
r(X) in order to calculate the expectation E[r(X)]. In fact, it can be shown that the 
value of E[r(X)] can always be calculated directly using the following result. 


Law of the Unconscious Statistician. Let X be a random variable, and let r be a real- 
valued function of a real variable. If X has a continuous distribution, then 


CO 
E[r(X)]= / r(x) f(x) dx, (4.1.9) 
—0o 
if the mean exists. If X has a discrete distribution, then 
E[r(X)|= }* r@)f@), (4.1.10) 
All x 


if the mean exists. 


Proof A general proof will not be given here. However, we shall provide a proof 
for two special cases. First, suppose that the distribution of X is discrete. Then the 
distribution of Y must also be discrete. Let g be the p.f. of Y. For this case, 


2980) = 2 P= 
-> ye 


xir(x)=y 


= > SS *@fay= 2, r(x) f(x). 


Y xir(x)=y 


Hence, Eq. (4.1.10) yields the same value as one would obtain from Definition 4.1.1 
applied to Y. 

Second, suppose that the distribution of X is continuous. Suppose also, as in 
Sec. 3.8, that r(x) is either strictly increasing or strictly decreasing with differentiable 
inverse s(y). Then, if we change variables in Eq. (4.1.9) from x to y=r(x), 


/ ONC te v#lsoy] | 22 we 


It now follows from Eq. (3.8.3) that the right side of this equation is equal to 


/ ye(y) dy. 


Hence, Eqs. (4.1.8) and (4.1.9) yield the same value. a 
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Example 
4.1.12 


Example 
4.1.13 


Example 
4.1.14 


Theorem 4.1.1 is called the law of the unconscious statistician because many peo- 
ple treat Eqs. (4.1.9) and (4.1.10) as the definition of E[r(X)] and forget that the 
definition of the mean of Y = r(X) is given in Definitions 4.1.2 and 4.1.4. 


Failure Rate and Time to Failure. In Example 4.1.11, we can apply Theorem 4.1.1 to 
find 


1 
E(Y) =a 1 ia 2 
Qo x 2 
the same result we got in Example 4.1.11. J 


Determining the Expectation of X!/?. Suppose that the p.d.f. of X is as given in Exam- 
ple 4.1.6 and that Y = X!/?. Then, by Eq. (4.1.9), 


1 1 
E(Y) = x'/2(2x) dx = 2f x3/2 dx= a 
0 0 


Note: In General, E[g(X)] 4 g(E(X)). In Example 4.1.13, the mean of X!/” is 4/5. 
The mean of X was computed in Example 4.1.6 as 2/3. Note that 4/5 4 (2/3)!/”. In 
fact, unless g is a linear function, it is generally the case that E[g(X)]4 g(E(X)). A 
linear function g does satisfy E[g(X)| = g(E(X)), as we shall see in Theorem 4.2.1. 


Option Pricing. Suppose that common stock in the up-and-coming company A is 
currently priced at $200 per share. As an incentive to get you to work for company 
A, you might be offered an option to buy a certain number of shares of the stock, one 
year from now, at a price of $200. This could be quite valuable if you believed that the 
stock was very likely to rise in price over the next year. For simplicity, suppose that 
the price X of the stock one year from now is a discrete random variable that can take 
only two values (in dollars): 260 and 180. Let p be the probability that X = 260. You 
want to calculate the value of these stock options, either because you contemplate 
the possibility of selling them or because you want to compare Company A’s offer 
to what other companies are offering. Let Y be the value of the option for one share 
when it expires in one year. Since nobody would pay $200 for the stock if the price X 
is less than $200, the value of the stock option is 0 if X = 180. If X = 260, one could 
buy the stock for $200 per share and then immediately sell it for $260. This brings in a 
profit of $60 per share. (For simplicity, we shall ignore dividends and the transaction 
costs of buying and selling stocks.) Then Y = h(X) where 


O ifx=180, 
60 if x = 260. 


Assume that an investor could earn 4% risk-free on any money invested for this 
same year. (Assume that the 4% includes any compounding.) If no other investment 
options were available, a fair cost of the option would then be what is called the 
present value of E(Y) in one year. This equals the value c such that E(Y) = 1.04c. 
That is, the expected value of the option equals the amount of money the investor 
would have after one year without buying the option. We can find E(Y) easily: 


E(Y)=0 x (— p) +60 x p=60p. 


h(x) = | 


So, the fair price of an option to buy one share would be c = 60p/1.04 = 57.69p. 
How should one determine the probability p? There is a standard method used 
in the finance industry for choosing p in this example. That method is to assume that 


Example 
4.1.15 


Theorem 
4.1.2 


Example 
4.1.16 
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the present value of the mean of X (the stock price in one year) is equal to the current 
value of the stock price. That is, assume that the expected value of buying one share 
of stock and waiting one year to sell is the same as the result of investing the current 
cost of the stock risk-free for one year (multiplying by 1.04 in this example). In our 
example, this means E(X) = 200 x 1.04. Since E(X) = 260p + 180(1 — p), we set 


200 x 1.04 = 260p + 180(1 — p), 


and obtain p = 0.35. The resulting price of an option to buy one share for $200 in 
one year would be $57.69 x 0.35 = $20.19. This price is called the risk-neutral price 
of the option. One can prove (see Exercise 14 in this section) that any price other than 
$20.19 for the option would lead to unpleasant consequences in the market. < 


Functions of Several Random Variables 


The Expectation of a Function of Two Variables. Let X and Y have a joint p.d-f., and 
suppose that we want the mean of X* + Y*. The most straightforward but most 
difficult way to do this would be to use the methods of Sec. 3.9 to find the distribution 
of Z = X* + Y? and then apply the definition of mean to Z. <j 


There is a version of Theorem 4.1.1 for functions of more than one random variable. 
Its proof is not given here. 


Law of the Unconscious Statistician. Suppose that X,,..., X, are random variables 
with the joint p.df. f(x, ..., x,). Let r be a real-valued function of n real variables, 
and suppose that Y = r(Xj,..., X,). Then E(Y) can be determined directly from the 
relation 


BO) = figs f rea ad fete a) dat dXn, 


if the mean exists. Similarly, if X;,..., X,, have a discrete joint distribution with p.f. 
f(x, ...,X,), the mean of Y=r(Xj,..., X,,) is 
B= YFG Gist) 
All x4,...,Xy 
if the mean exists. | 


Determining the Expectation of a Function of Two Variables. Suppose that a point (X, Y) 
is chosen at random from the square S containing all points (x, y) such that0 <x <1 
and 0 < y <1. We shall determine the expected value of X 24 2, 

Since X and Y have the uniform distribution over the square S, and since the 
area of S is 1, the joint p.d-f. of X and Y is 
1 for (x, y) ES, 
0 otherwise. 


fon =| 


Therefore, 


E(X? + ¥?) = [- [ie +) f(x, y) dx dy 


1 el 2 
=f / (x? + y’) dx dy ==. < 
0 JO 3 
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Note: More General Distributions. In Example 3.2.7, we introduced a type of distri- 
bution that was neither discrete nor continuous. It is possible to define expectations 
for such distributions also. The definition is rather cumbersome, and we shall not 
pursue it here. 


Summary 


The expectation, expected value, or mean of a random variable is a summary of its 
distribution. If the probability distribution is thought of as a distribution of mass 
along the real line, then the mean is the center of mass. The mean of a function r of a 
random variable X can be calculated directly from the distribution of X without first 
finding the distribution of r(X). Similarly, the mean of a function of a random vector 
X can be calculated directly from the distribution of X. 


Exercises 
1. Suppose that X has the uniform distribution on the 
interval [a, b]. Find the mean of X. 


2. If an integer between 1 and 100 is to be chosen at 
random, what is the expected value? 


3. In a class of 50 students, the number of students n; of 
each age i is shown in the following table: 


Agei nj; 
18 20 
19 22 
20 
21 


25 1 


If a student is to be selected at random from the class, what 
is the expected value of his age? 


4. Suppose that one word is to be selected at random from 
the sentence THE GIRL PUT ON HER BEAUTIFUL RED HAT. If X 
denotes the number of letters in the word that is selected, 
what is the value of E(X)? 


5. Suppose that one letter is to be selected at random from 
the 30 letters in the sentence given in Exercise 4. If Y 
denotes the number of letters in the word in which the 
selected letter appears, what is the value of E(Y)? 


6. Suppose that a random variable X has a continuous 
distribution with the p.d.f. f given in Example 4.1.6. Find 
the expectation of 1/X. 


7. Suppose that a random variable X has the uniform dis- 
tribution on the interval [0, 1]. Show that the expectation 
of 1/X is infinite. 


8. Suppose that X and Y have a continuous joint distribu- 
tion for which the joint p.d_f. is as follows: 


12)° ford<y<x <1, 
fon y=| , — 
0 otherwise. 
Find the value of E(XY). 


9. Suppose that a point is chosen at random on a stick of 
unit length and that the stick is broken into two pieces at 
that point. Find the expected value of the length of the 
longer piece. 


10. Suppose that a particle is released at the origin of 
the xy-plane and travels into the half-plane where x > 0. 
Suppose that the particle travels in a straight line and that 
the angle between the positive half of the x-axis and this 
line is a, which can be either positive or negative. Suppose, 
finally, that the angle a has the uniform distribution on the 
interval [—z/2, 2/2]. Let Y be the ordinate of the point at 
which the particle hits the vertical line x = 1. Show that 
the distribution of Y is a Cauchy distribution. 


11. Suppose that the random variables X,,..., X,, form 

a random sample of size n from the uniform distribution 

on the interval [0, 1]. Let Y, = min{X,,..., X,}, and let 
7) = Max{X,,..., X,}. Find E(Y,) and E(Y,). 


12. Suppose that the random variables X;,..., X,, form 
arandom sample of size n from a continuous distribution 
for which the c.d.f. is F, and let the random variables Y; 
and Y,, be defined as in Exercise 11. Find E[F(¥,)] and 
E[F(Y,)]. 


13. A stock currently sells for $110 per share. Let the price 
of the stock at the end of a one-year period be X, which will 
take one of the values $100 or $300. Suppose that you have 
the option to buy shares of this stock at $150 per share 
at the end of that one-year period. Suppose that money 


could earn 5.8% risk-free over that one-year period. Find 
the risk-neutral price for the option to buy one share. 


14. Consider the situation of pricing a stock option as in 
Example 4.1.14. We want to prove that a price other than 
$20.19 for the option to buy one share in one year for $200 
would be unfair in some way. 


a. Suppose that an investor (who has several shares of 
the stock already) makes the following transactions. 
She buys three more shares of the stock at $200 per 
share and sells four options for $20.19 each. The in- 
vestor must borrow the extra $519.24 necessary to 
make these transactions at 4% for the year. At the 
end of the year, our investor might have to sell four 
shares for $200 each to the person who bought the 
options. In any event, she sells enough stock to pay 
back the amount borrowed plus the 4 percent inter- 
est. Prove that the investor has the same net worth 
(within rounding error) at the end of the year as she 
would have had without making these transactions, 
no matter what happens to the stock price. (A combi- 
nation of stocks and options that produces no change 
in net worth is called a risk-free portfolio.) 


b. Consider the same transactions as in part (a), but 
this time suppose that the option price is $x where 
x < 20.19. Prove that our investor loses |4.16x — 84| 
dollars of net worth no matter what happens to the 
stock price. 
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c. Consider the same transactions as in part (a), but 
this time suppose that the option price is $x where 
x > 20.19. Prove that our investor gains 4.16x — 84 
dollars of net worth no matter what happens to the 
stock price. 


The situations in parts (b) and (c) are called arbi- 
trage opportunities. Such opportunities rarely exist for any 
length of time in financial markets. Imagine what would 
happen if the three shares and four options were changed 
to three million shares and four million options. 


15. In Example 4.1.14, we showed how to price an option 
to buy one share of a stock at a particular price at a partic- 
ular time in the future. This type of option is called a call 
option. A put option is an option to sell a share of a stock 
at a particular price $y at a particular time in the future. 
(If you don’t own any shares when you wish to exercise 
the option, you can always buy one at the market price 
and then sell it for $y.) The same sort of reasoning as in 
Example 4.1.14 could be used to price a put option. Con- 
sider the same stock as in Example 4.1.14 whose price in 
one year is X with the same distribution as in the example 
and the same risk-free interest rate. Find the risk-neutral 
price for an option to sell one share of that stock in one 
year at a price of $220. 


16. Let Y be a discrete random variable whose p.f. is the 
function f in Example 4.1.4. Let X = |Y|. Prove that the 
distribution of X has the p.d.f. in Example 4.1.5. 


4.2 Properties of Expectations 


In this section, we present some results that simplify the calculation of expectations 
for some common functions of random variables. 


Basic Theorems 


Suppose that X is a random variable for which the expectation E(X) exists. We shall 
present several results pertaining to the basic properties of expectations. 


Theorem 
4.2.1 


Linear Function. If Y =aX + b, where a and b are finite constants, then 


E(Y) =aE(X) +b. 


Proof We first shall assume, for convenience, that X has a continuous distribution 
for which the p.d.f. is f. Then 


EW) = Bax +b)= [ (ax +b)f(x) dx 
=af- spade f St (x) dx 


=aE(X) +b. 


A similar proof can be given for a discrete distribution. 
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Corollary 
4.2.1 


Example 
4.2.2 


Theorem 
4.2.2 


Calculating the Expectation of a Linear Function. Suppose that E(X) = 5. Then 
BOX =5) =38(%)=—5 = 10 
and 


E(—3X + 15) = —3E(X) + 15=0. < 
The following result follows from Theorem 4.2.1 with a = 0. 


If X =c with probability 1, then E(X) =c. a 


Investment. An investor is trying to choose between two possible stocks to buy for 
a three-month investment. One stock costs $50 per share and has a rate of return of 
R, dollars per share for the three-month period, where Rj is a random variable. The 
second stock costs $30 per share and has a rate of return of R> per share for the same 
three-month period. The investor has a total of $6000 to invest. For this example, 
suppose that the investor will buy shares of only one stock. (In Example 4.2.3, we 
shall consider strategies in which the investor buys more than one stock.) Suppose 
that R, has the uniform distribution on the interval [—10, 20] and that R> has the 
uniform distribution on the interval [—4.5, 10]. We shall first compute the expected 
dollar value of investing in each of the two stocks. For the first stock, the $6000 will 
purchase 120 shares, so the return will be 120R,, whose mean is 120E(R,) = 600. 
(Solve Exercise 1 in Sec. 4.1 to see why E(R,) =5.) For the second stock, the $6000 
will purchase 200 shares, so the return will be 200R2, whose mean is 200E(Rz) = 550. 
The first stock has a higher expected return. 

In addition to calculating expected return, we should also ask which of the two 
investments is riskier. We shall now compute the value at risk (VaR) at probability 
level 0.97 for each investment. (See Example 3.3.7 on page 113.) VaR will be the 
negative of the 1 — 0.97 = 0.03 quantile for the return on each investment. For the 
first stock, the return 120R, has the uniform distribution on the interval [—1200, 2400] 
(see Exercise 14 in Sec. 3.8) whose 0.03 quantile is (according to Example 3.3.8 on 
page 114) 0.03 x 2400 + 0.97 x (—1200) = —1092. So VaR= 1092. For the second 
stock, the return 200R, has the uniform distribution on the interval [—900, 2000] 
whose 0.03 quantile is 0.03 x 2000 + 0.97 x (—900) = —813. So VaR= 813. Even 
though the first stock has higher expected return, the second stock seems to be 
slightly less risky in terms of VaR. How should we balance risk and expected return 
to choose between the two purchases? One way to answer this question is illustrated 
in Example 4.8.10, after we learn about utility. <l 


If there exists a constant such that Pr(X >a) =1, then E(X) >a. If there exists a 
constant b such that Pr(X < b) = 1, then E(X) <b. 


Proof We shall assume again, for convenience, that X has a continuous distribution 


for which the p.d.f. is f, and we shall suppose first that Pr(X > a) = 1. Because X is 
bounded below, the second integral in (4.1.5) is finite. Then 


eo= [- xf@jde= f xf@)ds 


> [ af) dx=aPr(X 20) =a, 


The proof of the other part of the theorem and the proof for a discrete distribution 
are similar. 7 
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It follows from Theorem 4.2.2 that if Pr(a < X < b) =1, thena < E(X) <b. 


Suppose that E(X)=a and that either Pr(X >a)=1 or Pr(X <a) =1. Then 
Pr(X =a) =1. 


Proof We shall provide a proof for the case in which X has a discrete distribution 


and Pr(X > a) =1. The other cases are similar. Let x1, x», ... include every value 
x >a such that Pr(X = x) > 0, if any. Let po = Pr(X = a). Then, 
[o.@) 
E(X) = poa + 2 4: PX S%), (4.2.1) 
j=l 


Each x; in the sum on the right side of Eq. (4.2.1) is greater than a. If we replace all 
of the x;’s by a, the sum can’t get larger, and hence 


Co 
E(X) => poa + ~~ aPr(X =x,;) =a. (4.2.2) 

j=l 
Furthermore, the inequality in Eq. (4.2.2) will be strict if there is even one x > a with 
Pr(X = x) > 0. This contradicts E(X) =a. Hence, there can be no x > a such that 
Pr(xX =x) >0. | 


If X,..., X, are n random variables such that each expectation E(X;,) is finite 
(i =1,...,n), then 


Proof We shall first assume that n = 2 and also, for convenience, that X, and X, have 
a continuous joint distribution for which the joint p.d-f. is f. Then 


[oe] [o,e) 
E(X, +X) = i / (xy +2) fry, x2) dxy dey 

—0o —0o 
[oe] [o,e) CO le) 

=| / X41 f (X41, 2) dxy ax, + | / Xo f (X1, Xz) dx1 dxz 
—~co J —00 —o0 J—00 
[oe] [o.@) [oe @) 

=) i X1f (x1, X2) dxz dx, + f Xq fn(Xq) dx 
—0o —0o —0o 


= / x4 fy (41) dxy + / X9 fa(X2) dxz 


= E(X1) + E(X)), 


where f; and f> are the marginal p.d.f’s of X, and X>. The proof for a discrete 
distribution is similar. Finally, the theorem can be established for each positive 
integer n by an induction argument. rT] 


It should be emphasized that, in accordance with Theorem 4.2.4, the expectation 
of the sum of several random variables always equals the sum of their individual 
expectations, regardless of what their joint distribution is. Even though the joint p.d.f. 
of X, and X> appeared in the proof of Theorem 4.2.4, only the marginal p.d.f’s figured 
into the calculation of E(X, + X). 

The next result follows easily from Theorems 4.2.1 and 4.2.4. 


Assume that £(X;,) is finite fori =1,..., 7. For all constants ay, ..., a, and b, 


E(a,X,+---+a,X, +b) =a, E(X1) +++: +a,E(X,) +b. a 


220 


Chapter 4 Expectation 


Example 
4.2.3 


Example 
4.2.4 


Definition 
4.2.1 


Theorem 
4.2.5 


Investment Portfolio. Suppose that the investor with $6000 in Example 4.2.2 can buy 
shares of both of the two stocks. Suppose that the investor buys s, shares of the first 
stock at $50 per share and sz shares of the second stock at $30 per share. Such a 
combination of investments is called a portfolio. Ignoring possible problems with 
fractional shares, the values of s; and sy must satisfy 


50s, + 30s5 = 6000, 


in order to invest the entire $6000. The return on this portfolio will be s;R, + 57Rp. 
The mean return will be 


Sy E(R4) + 59 E (Ro) = 551 + 2.7559. 


For example, if s; =54 and s, = 110, then the mean return is 572.5. 4 


Sampling without Replacement. Suppose that a box contains red balls and blue balls 
and that the proportion of red balls in the box is p (0 < p < 1). Suppose that n balls are 
selected from the box at random without replacement, and let X denote the number 
of red balls that are selected. We shall determine the value of E(X). 

We shall begin by defining n random variables X;,..., X,, as follows: For i = 
1,...,n, let X; =1if the ith ball that is selected is red, and let X; = 0 if the ith ball 
is blue. Since the n balls are selected without replacement, the random variables 
X1,..., X, are dependent. However, the marginal distribution of each X; can be 
derived easily (see Exercise 10 of Sec. 1.7). We can imagine that all the balls are 
arranged in the box in some random order, and that the first n balls in this arrange- 
ment are selected. Because of randomness, the probability that the ith ball in the 
arrangement will be red is simply p. Hence, fori =1,...,7, 


Pr(X;=1)=p and Pr(X;=0)=1-p. (4.2.3) 


Therefore, E(X;) = 1(p) +001 — p) = p. 

From the definition of X;,..., X,, it follows that X; ++---+ X,, is equal to the 
total number of red balls that are selected. Therefore, X = X;+---+ X, and, by 
Theorem 4.2.4, 


E(X) = E(X1) +--+ + E(Xq) =np. (4.2.4) 
< 


Note: In General, E[g(X)] 4 g(E(X)). Theorems 4.2.1 and 4.2.4 imply that if g is a 
linear function of a random vector X, then E[g(X)]= g(E(X)). For a nonlinear func- 
tion g, we have already seen Example 4.1.13 in which E[g(X)] 4 g(E(X)). Jensen’s 
inequality (Theorem 4.2.5) gives a relationship between E[g(X)] and g(E(X)) for 
another special class of functions. 


Convex Functions. A function g of a vector argument is convex if, for every a € (0, 1), 
and every x and y, 


glax + (1—a)y] > ag(v) + (1 —a)g(y). 


The proof of Theorem 4.2.5 is not given, but one special case is left to the reader in 
Exercise 13. 


Jensen’s Inequality. Let g be a convex function, and let X¥ be a random vector with 
finite mean. Then E[g(X)] > g(E(X)). | 


Example 
4.2.5 


Example 
4.2.6 
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Sampling with Replacement. Suppose again that in a box containing red balls and 
blue balls, the proportion of red balls is p (0 < p < 1). Suppose now, however, that 
a random sample of n balls is selected from the box with replacement. If X denotes 
the number of red balls in the sample, then X has the binomial distribution with 
parameters n and p, as described in Sec. 3.1. We shall now determine the value 


of E(X). 

As before, fori =1,...,n, let X; = 1if the ith ball that is selected is red, and let 
X; = 0 otherwise. Then, as before, X = X,+---+ X,. In this problem, the random 
variables X,,..., X, are independent, and the marginal distribution of each X; is 


again given by Eq. (4.2.3). Therefore, E(X;) = p fori=1,...,n, and it follows from 
Theorem 4.2.4 that 


E(X)=np. (4.2.5) 


Thus, the mean of the binomial distribution with parameters n and p is np. The 
p-f£. f(x) of this binomial distribution is given by Eq. (3.1.4), and the mean can be 
computed directly from the p.f. as follows: 


n 


n x NX 
E(X)= > (")p qv. (4.2.6) 


x=0 


Hence, by Eq. (4.2.5), the value of the sum in Eq. (4.2.6) must be np. < 


It is seen from Eqs. (4.2.4) and (4.2.5) that the expected number of red balls 
in a sample of n balls is np, regardless of whether the sample is selected with or 
without replacement. However, the distribution of the number of red balls is different 
depending on whether sampling is done with or without replacement (for n > 1). 
For example, Pr(X =n) is always smaller in Example 4.2.4 where sampling is done 
without replacement than in Example 4.2.5 where sampling is done with replacement, 
ifn > 1. (See Exercise 27 in Sec. 4.9.) 


Expected Number of Matches. Suppose that a person types n letters, types the ad- 
dresses on n envelopes, and then places each letter in an envelope in a random 
manner. Let X be the number of letters that are placed in the correct envelopes. 
We shall find the mean of X. (In Sec. 1.10, we did a more difficult calculation with 
this same example.) 


Fori =1,...,n, let X; =1if the ith letter is placed in the correct envelope, and 
let X; = 0 otherwise. Then, fori =1,..., 7, 
1 1 
Prix; =1)=- and Pr(x;=0)=1--. 
n n 
Therefore, 


E(x) == fori=1,...,n. 


Since X = X,+---+ X,, it follows that 
E(X) = E(X4) +---+ E(X,) 
1 1 
=—+.--+- =1. 
n n 


Thus, the expected value of the number of correct matches of letters and envelopes 
is 1, regardless of the value of n. < 
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Theorem 
4.2.6 


Example 
4.2.7 


Example 
4.2.8 


Expectation of a Product of Independent Random Variables 


If X;,..., X, aren independent random variables such that each expectation E(X;) 
is finite (i =1,...,n), then 


e(T x) = Tl E(X;). 
i=l i=1 


Proof We shall again assume, for convenience, that X),..., X, have a continuous 
joint distribution for which the joint p.d-f. is f. Also, we shall let f; denote the mar- 
ginal p.d-f. of X; i =1,..., 2). Then, since the variables Xj, ..., X,, are independent, 
it follows that at every point (x1,...,x,) € R", 


ice” a eee 
i=1 


Therefore, 
n oO 00 n 
(1%) =i of (Ts) siesta 
i=l “ee ~o N\i=l 
lo) oe) n 
=I of | [HAG 4a eves, 
ee ~% Li=t 
n oO n 
-T/ x; fi(x;) dx; =| | EX). 
i=l i=l 
The proof for a discrete distribution is similar. = 


The difference between Theorem 4.2.4 and Theorem 4.2.6 should be emphasized. 
If it is assumed that each expectation is finite, the expectation of the sum of a group 
of random variables is always equal to the sum of their individual expectations. 
However, the expectation of the product of a group of random variables is not always 
equal to the product of their individual expectations. If the random variables are 
independent, then this equality will also hold. 


Calculating the Expectation of a Combination of Random Variables. Suppose that Xj, 
X , and X3 are independent random variables such that E(X;) =0 and E (X?) =1for 
i = 1, 2, 3. We shall determine the value of E[Xi(X2 =4X5)"], 


Since X,, Xz, and X3 are independent, it follows that the two random variables 
X : and (X, — 4X;3)* are also independent. Therefore, 
E[Xq (Xp — 4X3)"] = E(XP)E[(Xp — 4X3)"] 
= E(X} — 8X)X3 + 16X) 
= E(X3) — 8E(XX3) + 16E(X3) 
=1- 8E(X,)E(X3) +16 
=17. < 


Repeated Filtering. A filtration process removes a random proportion of particulates 
in water to which it is applied. Suppose that a sample of water is subjected to this 
process twice. Let X, be the proportion of the particulates that are removed by 
the first pass. Let X, be the proportion of what remains after the first pass that 
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is removed by the second pass. Assume that X; and X, are independent random 
variables with common p.d.f. f(x) = 4x3 for 0 <x <1and f(x) =0 otherwise. Let 
Y be the proportion of the original particulates that remain in the sample after two 
passes. Then Y = (1 — X,)(1 — X2). Because X, and X>, are independent, so too are 
1— X, and 1— X>. Since 1 — X, and 1 — X, have the same distribution, they have the 
same mean, call it jz. It follows that Y has mean p2. We can find p as 


1 
w= E(—X)) =f (1 — x)4xfdx,=1- : =0.2. 
0 


It follows that E(Y) = 0.2? = 0.04. < 


Expectation for Nonnegative Distributions 


Theorem 
4.2.7 


Example 
4.2.9 


Integer-Valued Random Variables. Let X be a random variable that can take only the 
values 0, 1, 2,.... Then 


Ce 
E(X)= ¥ Pr(X >n). (4.2.7) 
n=1 
Proof First, we can write 
oe) oe) 
E(X)= > n Pr(X =n) = > n Pr(X =n). (4.2.8) 
n=0 n=1 


Next, consider the following triangular array of probabilities: 


Pr(xX =1) Prix =2) Pr(x =3) 
Pr(X =2) Pr(xX =3) 
Pr(x = 3) 


We can compute the sum of all the elements in this array in two different ways 
because all of the summands are nonnegative. First, we can add the elements in each 
column of the array and then add these column totals. Thus, we obtain the value 
ye n Pr(X =n). Second, we can add the elements in each row of the array and then 
add these row totals. In this way we obtain the value ae Pr(X > n). Therefore, 


Son Pr(X =n) =) Pr(x >n). 


n=1 n=1 


Eq. (4.2.7) now follows from Eq. (4.2.8). | 


Expected Number of Trials. Suppose that a person repeatedly tries to perform a certain 
task until he is successful. Suppose also that the probability of success on each given 
trial is p (0 < p <1) and that all trials are independent. If X denotes the number 
of the trial on which the first success is obtained, then E(X) can be determined as 
follows. 

Since at least one trial is always required, Pr(X > 1) = 1. Also, forn =2,3,..., 
at least n trials will be required if and only if none of the first n — 1 trials results in 
success. Therefore, 


Pr(X >n) =(1— py". 
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By Eq. (4.2.7), it follows that 
1 1 


Peel t d= Ff) 4p SS ——__e=. 
(Xx) (d—p)+d-— p) i-d-p > 


< 


Theorem 4.2.7 has a more general version that applies to all nonnegative random 
variables. 


Theorem General Nonnegative Random Variable. Let X be a nonnegative random variable with 
4.2.8 c.d.f. F. Then 

[o.e} 

E(X)= / [1 — F(x)]dx. (4.2.9) 

: | 

The proof of Theorem 4.2.8 is left to the reader in Exercises 1 and 2 in Sec. 4.9. 

Example Expected Waiting Time. Let X be the time that a customer spends waiting for service 

4.2.10 in a queue. Suppose that the c.d.f. of X is 


0 ifx <0 
F(x) = pee 
@) | ie ifs. 
Then the mean of X is 
2 1 
E(X) = e dx = =. < 
0 2 


>, 
“ 


Summary 


The mean of a linear function of a random vector is the linear function of the mean. 
In particular, the mean of a sum is the sum of the means. As an example, the mean of 
the binomial distribution with parameters n and p is np. No such relationship holds 
in general for nonlinear functions. For independent random variables, the mean of 


the product is the product of the means. 


Exercises 


1. Suppose that the return R (in dollars per share) of a 
stock has the uniform distribution on the interval [—3, 7]. 
Suppose also, that each share of the stock costs $1.50. 
Let Y be the net return (total return minus cost) on an 
investment of 10 shares of the stock. Compute E(Y). 


2. Suppose that three random variables X,, X, X3 form 
a random sample from a distribution for which the mean 
is 5. Determine the value of 


BOX, = 3% +X =. 


3. Suppose that three random variables X,, X, X3 form 
a random sample from the uniform distribution on the 
interval [0, 1]. Determine the value of 


EI = 2X5+ ®)'1. 


4. Suppose that the random variable X has the uniform 
distribution on the interval [0, 1], that the random vari- 
able Y has the uniform distribution on the interval [5, 9], 
and that X and Y are independent. Suppose also that a 
rectangle is to be constructed for which the lengths of two 
adjacent sides are X and Y. Determine the expected value 
of the area of the rectangle. 


5. Suppose that the variables X1,..., X, form a random 
sample of size n from a given continuous distribution on 
the real line for which the p.d.f. is f. Find the expecta- 
tion of the number of observations in the sample that fall 
within a specified interval a <x <b. 


6. Suppose that a particle starts at the origin of the real 
line and moves along the line in jumps of one unit. For 
each jump, the probability is p (0 < p < 1) that the particle 
will jump one unit to the left and the probability is 1 — p 
that the particle will jump one unit to the right. Find the 
expected value of the position of the particle after n jumps. 


7. Suppose that on each play of a certain game a gambler 
is equally likely to win or to lose. Suppose that when he 
wins, his fortune is doubled, and that when he loses, his 
fortune is cut in half. If he begins playing with a given 
fortune c, what is the expected value of his fortune after 
n independent plays of the game? 


8. Suppose that a class contains 10 boys and 15 girls, and 
suppose that eight students are to be selected at random 
from the class without replacement. Let X denote the 
number of boys that are selected, and let Y denote the 
number of girls that are selected. Find E(X — Y). 


9. Suppose that the proportion of defective items in a 
large lot is p, and suppose that a random sample of n 
items is selected from the lot. Let X denote the number of 
defective items in the sample, and let Y denote the number 
of nondefective items. Find E(X — Y). 


10. Suppose that a fair coin is tossed repeatedly until a 
head is obtained for the first time. (a) What is the expected 
number of tosses that will be required? (b) What is the 
expected number of tails that will be obtained before the 
first head is obtained? 


11. Suppose that a fair coin is tossed repeatedly until ex- 
actly k heads have been obtained. Determine the expected 
number of tosses that will be required. Hint: Represent the 
total number of tosses X in the form X = X,+---+ X;, 
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where X; is the number of tosses required to obtain the 
ith head after i — 1 heads have been obtained. 


12. Suppose that the two return random variables R; and 
R> in Examples 4.2.2 and 4.2.3 are independent. Consider 
the portfolio at the end of Example 4.2.3 with s; =54 
shares of the first stock and s, = 110 shares of the second 
stock. 


a. Prove that the change in value X of the portfolio has 


the p.d.-f. 
f(x) 
3.87 x 1077(x + 1035) if —1035 <x < 560, 
_ 6.1728 x 10-4 if 560 < x < 585, 
3.87 x 1077(2180 — x) if 585 <x < 2180, 
0 otherwise. 


Hint: Look at Example 3.9.5. 
b. Find the value at risk (VaR) at probability level 0.97 
for the portfolio. 


13. Prove the special case of Theorem 4.2.5 in which the 
function g is twice continuously differentiable and X is 
one-dimensional. You may assume that a twice continu- 
ously differentiable convex function has nonnegative sec- 
ond derivative. Hint: Expand g(X) around its mean using 
Taylor’s theorem with remainder. Taylor’s theorem with 
remainder says that if g(x) has two continuous derivatives 
g’ and g” at x = x, then there exists y between xp and x 
such that 


(x — xo)” 


5 8 (y). 


g(x) = g(xp) + (& — x9)8'(X%o) + 


Although the mean of a distribution is a useful summary, it does not convey 
very much information about the distribution. For example, a random variable 
X with mean 2 has the same mean as the constant random variable Y such that 
Pr(Y = 2) = 1 even if X is not constant. To distinguish the distribution of X from 
the distribution of Y in this case, it might be useful to give some measure of how 
spread out the distribution of X is. The variance of X is one such measure. The 
standard deviation of X is the square root of the variance. The variance also plays 
an important role in the approximation methods that arise in Chapter 6. 


Example 
4.3.1 


Stock Price Changes. Consider the prices A and B of two stocks at a time one month in 
the future. Assume that A has the uniform distribution on the interval [25, 35] and B 


has the uniform distribution on the interval [15, 45]. It is easy to see (from Exercise 1 
in Sec. 4.1) that both stocks have a mean price of 30. But the distributions are very 
different. For example, A will surely be worth at least 25 while Pr(B < 25) = 1/3. 
But B has more upside potential also. The p.d.f.’s of these two random variables are 


plotted in Fig. 4.5. 


< 
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Figure 4.5 The p.d.f’s of 
two uniform distributions 
in Example 4.3.1. Both 
distributions have mean 
equal to 30, but they are 
spread out differently. 
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Definitions of the Variance and the Standard Deviation 


Although the two random prices in Example 4.3.1 have the same mean, price B 
is more spread out than price A, and it would be good to have a summary of the 
distribution that makes this easy to see. 


Variance/Standard Deviation. Let X be arandom variable with finite mean wp = E(X). 
The variance of X, denoted by Var(X), is defined as follows: 


Var(X) = E[(X — )”]. (4.3.1) 


If X has infinite mean or if the mean of X does not exist, we say that Var(X) does 
not exist. The standard deviation of X is the nonnegative square root of Var(X) if the 
variance exists. 


If the expectation in Eq. (4.3.1) is infinite, we say that Var(X) and the standard 
deviation of X are infinite. 

When only one random variable is being discussed, it is common to denote its 
standard deviation by the symbol o, and the variance is denoted by o”. When more 
than one random variable is being discussed, the name of the random variable is 
included as a subscript to the symbol o, e.g., cy would be the standard deviation of 
X while a, would be the variance of Y. 


Stock Price Changes. Return to the two random variables A and B in Example 4.3.1. 
Using Theorem 4.1.1, we can compute 


35 5 3 
VO=) wasn we / Cia. = 
25 10 10 J_s IOS |e 2 
15 
45 15 3 
Vam=) 3072 / y-dy = ae aos 
1s 30 30 J_15 BO Bye 


So, Var(B) is nine times as large as Var(A). The standard deviations of A and B are 
oO, = 2.87 and og = 8.66. < 


Note: Variance Depends Only on the Distribution. The variance and standard 
deviation of a random variable X depend only on the distribution of X, just as 
the expectation of X depends only on the distribution. Indeed, everything that can 
be computed from the p.f. or p.d.f. depends only on the distribution. Two random 
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4.3.1 


Example 
4.3.4 
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variables with the same distribution will have the same variance, even if they have 
nothing to do with each other. 


Variance and Standard Deviation of a Discrete Distribution. Suppose that a random 
variable X can take each of the five values —2, 0, 1, 3, and 4 with equal probability. 
We shall determine the variance and standard deviation of X. 

In this example, 


Let w = E(X) = 1.2, and define W = (X — 1)”. Then Var(X) = E(W). We can easily 
compute the p.f. f of W: 


x —2 0 1 3 4 
w 10.24 1.44 0.04 3.24 7.84 
fw) 1/5 1/5 1/5 1/5 1/5 


It follows that 
Var(X) = E(W) = =(10.24 + 1.44 + 0.04 + 3.24 + 7.84] = 4.56. 


The standard deviation of X is the square root of the variance, namely, 2.135. < 


There is an alternative method for calculating the variance of a distribution, 
which is often easier to use. 


Alternative Method for Calculating the Variance. For every random variable X, 
Var(X) = E(X*) — [E(X)f. 
Proof Let E(X) =. Then 
Var(X) = E[(X — )"] 

= E(X? — 2X + p?) 

— E(X) — 2E(X) + pw 

= E(X7) — p”. a 
Variance of a Discrete Distribution. Once again, consider the random variable X in 
Example 4.3.3, which takes each of the five values —2, 0, 1,3, and 4 with equal 


probability. We shall use Theorem 4.3.1 to compute Var(X). In Example 4.3.3, we 
computed the mean of X as w = 1.2. To use Theorem 4.3.1, we need 


E(X’) = zl-2" +0°+17°4+374+4]=6. 
Because E(X) = 1.2, Theorem 4.3.1 says that 
Var(X) = 6 — (1.2)? = 4.56, 


which agrees with the calculation in Example 4.3.3. < 
The variance (as well as the standard deviation) of a distribution provides a mea- 


sure of the spread or dispersion of the distribution around its mean jz. A small value of 
the variance indicates that the probability distribution is tightly concentrated around 
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4; a large value of the variance typically indicates that the probability distribution 
has a wide spread around jw. However, the variance of a distribution, as well as its 
mean, can be made arbitrarily large by placing even a very small but positive amount 
of probability far enough from the origin on the real line. 


Slight Modification of a Bernoulli Distribution. Let X be a discrete random variable 
with the following p.d.f.: 


0.5 if x =0, 

0.499 ifx=1, 
f@= i 

0.001 if x = 10,000, 

0 otherwise. 


There is a sense in which the distribution of X differs very little from the Bernoulli 
distribution with parameter 0.5. However, the mean and variance of X are quite 
different from the mean and variance of the Bernoulli distribution with parame- 
ter 0.5. Let Y have the Bernoulli distribution with parameter 0.5. In Example 4.1.3, 
we computed the mean of Y as E(Y) =0.5. Since y2=Y, E(Y?) = E(Y) =0.5, so 
Var(¥) = 0.5 — 0.5% = 0.25. The means of X and X? are also straightforward calcula- 
tions: 


E(X) =0.5 x 0+. 0.499 x 1+ 0.001 x 10,000 = 10.499 
E(X’) =0.5 x 0? + 0.499 x 17 + 0.001 x 10,0007 = 100,000.499. 


So Var(X) = 99,890.27. The mean and variance of X are much larger than the mean 
and variance of Y. <1 


Properties of the Variance 


We shall now present several theorems that state basic properties of the variance. In 
these theorems we shall assume that the variances of all the random variables exist. 
The first theorem concerns the possible values of the variance. 


For each X, Var(X) > 0. If X is a bounded random variable, then Var(X) must exist 
and be finite. 


Proof Because Var(X) is the mean of a nonnegative random variable (X — py, it 
must be nonnegative according to Theorem 4.2.2. If X is bounded, then the mean 
exists, and hence the variance exists. Furthermore, if X is bounded the so too is 
(X — yw)”, so the variance must be finite. | 


The next theorem shows that the variance of a random variable X cannot be 0 unless 
the entire probability distribution of X is concentrated at a single point. 


Var(X) = 0 if and only if there exists a constant c such that Pr(X =c) = 1. 


Proof Suppose first that there exists a constant c such that Pr(X =c) =1. Then 
E(X) =c, and Pr[(X — ce = 0] = 1. Therefore, 


Var(X) = E[(X —c)*]=0. 


Conversely, suppose that Var(X)=0. Then Pr[(X — pv)? >0]=1 but 
E[(X — )*]=0. It follows from Theorem 4.2.3 that 


Pr[(X — p)* =0]=1. 
Hence, Pr(X = yw) = 1. | 


Figure 4.6 The p.d-f. of a 
random variable X together 
with the p.d.f’s of X +3 and 
—X. Note that the spreads of 
all three distributions appear 
the same. 
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For constants a and b, let Y =aX + b. Then 
Var(Y) = a Var(X), 


and oy = |aloy. 
Proof If E(X) =p, then E(Y) =ayu + b by Theorem 4.2.1. Therefore, 
Var(Y) = E[(aX + b — ap — b)*] = E[(aX — ap)*] 


=a E[(X — )?] =a? Var(X). 
Taking the square root of Var(Y) yields |a|oy. rT] 

It follows from Theorem 4.3.4 that Var(X + b) = Var(X) for every constant b. 
This result is intuitively plausible, since shifting the entire distribution of X a distance 
of b units along the real line will change the mean of the distribution by b units but 
the shift will not affect the dispersion of the distribution around its mean. Figure 4.6 
shows the p.d.f. a random variable X together with the p.d.f. of X + 3 to illustrate 
how a shift of the distribution does not affect the spread. 

Similarly, it follows from Theorem 4.3.4 that Var(—X) = Var(X). This result also 
is intuitively plausible, since reflecting the entire distribution of X with respect to the 
origin of the real line will result in a new distribution that is the mirror image of the 
original one. The mean will be changed from yu to —j, but the total dispersion of 
the distribution around its mean will not be affected. Figure 4.6 shows the p.d.f. of a 
random variable X together with the p.d.f. of —X to illustrate how a reflection of the 
distribution does not affect the spread. 


Calculating the Variance and Standard Deviation of a Linear Function. Consider the same 
random variable X as in Example 4.3.3, which takes each of the five values —2, 0, 1, 3, 
and 4 with equal probability. We shall determine the variance and standard deviation 


of Y=4xX —7. 
In Example 4.3.3, we computed the mean of X as w = 1.2 and the variance as 


4.56. By Theorem 4.3.4, 
Var(Y) = 16 Var(X) = 72.96. 


Also, the standard deviation o of Y is 
oy =4oy = 4(4.56)/? = 8.54. < 
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The next theorem provides an alternative method for calculating the variance of 
a sum of independent random variables. 


If X,,..., X,, are independent random variables with finite means, then 


Var(X, +-+-+ X,) = Var(X1) +---+ Var(X,,). 


Proof Suppose first that n = 2. If E(X,) = wy, and E(X>) = po, then 
E(X, + X2) = y+ M2. 
Therefore, 
Var(X1 + Xo) = E[(X1 + Xo — oa — 127] 
= E[(Xy — py)? + (Xp — Mp)? + 20%] — wy) (Xp — by) ] 
= Var(X1) + Var(X2) + 2E[(X1 — 4)(X2 — M2)]. 
Since X; and X> are independent, 
E[(X1 — 11) (Xp — M2)] = EX — my) E(X2 — 1) 
= (Hy — M1) (M2 — M2) 
=0. 
It follows, therefore, that 
Var(X, + Xz) = Var(X,) + Var(X>). 
The theorem can now be established for each positive integer n by an induction 


argument. | 


It should be emphasized that the random variables in Theorem 4.3.5 must be 
independent. The variance of the sum of random variables that are not independent 
will be discussed in Sec. 4.6. By combining Theorems 4.3.4 and 4.3.5, we can now 
obtain the following corollary. 


If X,,..., X, are independent random variables with finite means, and if ay, ..., a, 
and b are arbitrary constants, then 


Var(ayX1 +--+ +a,X, +b) =a} Var(X)) +--+ +a? Var(X,). = 


Investment Portfolio. An investor with $100,000 to invest wishes to construct a port- 
folio consisting of shares of one or both of two available stocks and possibly some 
fixed-rate investments. Suppose that the two stocks have random rates of return R; 
and R, per share for a period of one year. Suppose that R, has a distribution with 
mean 6 and variance 55, while Rj has mean 4 and variance 28. Suppose that the first 
stock costs $60 per share and the second costs $48 per share. Suppose that money 
can also be invested at a fixed rate of 3.6 percent per year. The portfolio will consist 
of s; shares of the first stock, sy shares of the second stock, and all remaining money 
($s3) invested at the fixed rate. The return on this portfolio will be 


sR, + SoRo + 0.03653, 
where the coefficients are constrained by 


60s, + 485) +53 = 100,000, (4.3.2) 


Figure 4.7 The set of all 
means and variances of 
investment portfolios in 
Example 4.3.7. The solid 
vertical line shows the range 
of possible variances for 
portfoloios with a mean of 
7000. 
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as well as 51, 52, 53 => 0. For now, we shall assume that R; and R> are independent. The 
mean and the variance of the return on the portfolio will be 


E(s,R, + SR5 + 0.03653) => 651 + As, + 0.036s3, 
Var(s,R1 + 52R + 0.03653) = 55s? + 2853. 


One method for comparing a class of portfolios is to say that portfolio A is at least 
as good as portfolio B if the mean return for A is at least as large as the mean return 
for B and if the variance for A is no larger than the variance of B. (See Markowitz, 
1987, for a classic treatment of such methods.) The reason for preferring smaller 
variance is that large variance is associated with large deviations from the mean, 
and for portfolios with a common mean, some of the large deviations are going to 
have to be below the mean, leading to the risk of large losses. Figure 4.7 is a plot 
of the pairs (mean, variance) for all of the possible portfolios in this example. That 
is, for each (51, 52, 53) that satisfy (4.3.2), there is a point in the outlined region of 
Fig. 4.7. The points to the right and toward the bottom are those that have the largest 
mean return for a fixed variance, and the ones that have the smallest variance for 
a fixed mean return. These portfolios are called efficient. For example, suppose that 
the investor would like a mean return of 7000. The vertical line segment above 7000 
on the horizontal axis in Fig. 4.7 indicates the possible variances of all portfolios with 
mean return of 7000. Among these, the portfolio with the smallest variance is efficient 
and is indicated in Fig. 4.7. This portfolio has s; = 524.7, sy = 609.7, 53 = 39,250, and 
variance 2.55 x 10’. So, every portfolio with mean return greater than 7000 must have 
variance larger than 2.55 x 10’, and every portfolio with variance less than 2.55 x 107 
must have mean return smaller than 7000. < 


The Variance of a Binomial Distribution 


We shall now consider again the method of generating a binomial distribution pre- 
sented in Sec. 4.2. Suppose that a box contains red balls and blue balls, and that the 
proportion of red balls is p (0 < p < 1). Suppose also that a random sample of n balls 
is selected from the box with replacement. For i = 1, ...,n, let X; = 1if the ith ball 
that is selected is red, and let X; = 0 otherwise. If X denotes the total number of red 
balls in the sample, then X = X; +---+ X, and X will have the binomial distribution 
with parameters n and p. 
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Figure 4.8 Two binomial 


distributions with the same 


mean (2.5) but different 


variances. 


Example 


4.3.8 


p-f.A 
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Since X;,..., X,, are independent, it follows from Theorem 4.3.5 that 
n 
Var(X) = os Var(X;). 
i=l 
According to Example 4.1.3, E(X;) = p fori =1,...,n. Since Pe = X; for each i, 


E(X?) = E(X;) = p. Therefore, by Theorem 4.3.1, 
Var(X;) = E(X?) — [E(X)P 


=p- p= pi—p). 
It now follows that 


Var(X) =np(1— p). (4.3.3) 


Figure 4.8 compares two different binomial distributions with the same mean 
(2.5) but different variances (1.25 and 1.875). One can see how the p.f. of the distri- 
bution with the larger variance (n = 10, p = 0.25) is higher at more extreme values 
and lower at more central values than is the p.f. of the distribution with the smaller 
variance (n = 5, p = 0.5). Similarly, Fig. 4.5 compares two uniform distributions with 
the same mean (30) and different variances (8.33 and 75). The same pattern appears, 
namely that the distribution with larger variance has higher p.d-f. at more extreme 
values and lower p.d.f. at more central values. 


Interquartile Range 


The Cauchy Distribution. In Example 4.1.8, we saw a distribution (the Cauchy dis- 
tribution) whose mean did not exist, and hence its variance does not exist. But, we 
might still want to describe how spread out such a distribution is. For example, if X 
has the Cauchy distribution and Y = 2X, it stands to reason that Y is twice as spread 
out as X is, but how do we quantify this? 4 


There is a measure of spread that exists for every distribution, regardless of 
whether or not the distribution has a mean or variance. Recall from Definition 3.3.2 
that the quantile function for a random variable is the inverse of the c.d-f., and it is 
defined for every random variable. 


Definition 
4.3.2 


Example 
4.3.9 
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Interquartile Range (IQR). Let X be arandom variable with quantile function F~!(p) 
for 0 < p <1. The interquartile range (IOR) is defined to be F~!(0.75) — F~!(0.25). 


In words, the IQR is the length of the interval that contains the middle half of the 
distribution. 


The Cauchy Distribution. Let X have the Cauchy distribution. The c.d.f. F of X can 
be found using a trigonometric substitution in the following integral: 


ik dy a! tan—!(x) 
ca =| m(l+y2) 2 oe 


? 


where tan~!(x) is the principal inverse of the tangent function, taking values from 
—s/2 to 1/2 as x runs from —oo to oo. The quantile function of X is then F~!(p) = 
tan[z(p — 1/2)] for 0 < p <1. The IOR is 


F'(0.75) — F7!(0.25) = tan( /4) — tan(—z/4) = 2. 


It is not difficult to show that, if Y = 2X, then the IQR of Y is 4. (See Exercise 14.) 
<4 


Summary 


The variance of X, denoted by Var(X), is the mean of [X — E(X iF and measures how 
spread out the distribution of X is. The variance also equals E(X?) — [E(X). The 
standard deviation is the square root of the variance. The variance of aX + b, where 
a and b are constants, is a2 Var(X). The variance of the sum of independent random 
variables is the sum of the variances. As an example, the variance of the binomial 
distribution with parameters n and p is np(1 — p). The interquartile range (IQR) is 
the difference between the 0.75 and 0.25 quantiles. The IQR is a measure of spread 
that exists for every distribution. 


Exercises 


1. Suppose that X has the uniform distribution on the 
interval [0, 1]. Compute the variance of X. 


2. Suppose that one word is selected at random from the 
sentence THE GIRL PUT ON HER BEAUTIFUL RED HAT. If X 
denotes the number of letters in the word that is selected, 
what is the value of Var(X)? 


3. For all numbers a and b such that a < b, find the vari- 
ance of the uniform distribution on the interval [a, b]. 


4. Suppose that X is a random variable for which E(X) = 
wand Var(X) = 02. Show that 
E[X(X —)]=u(u—1) +07. 


5. Let X be a random variable for which E(X) = wu and 
Var(X) = 07, and let c be an arbitrary constant. Show that 


E[(X —c)]=(u 


ey ot, 


6. Suppose that X and Y are independent random vari- 
ables whose variances exist and such that E(X) = E(Y). 
Show that 


E[(X — Y)*] = Var(X) + Var(Y). 


7. Suppose that X and Y are independent random vari- 
ables for which Var(X) = Var(Y) = 3. Find the values of 
(a) Var(X — Y) and (b) Var(2X — 3Y + 1). 


8. Construct an example of a distribution for which the 
mean is finite but the variance is infinite. 


9, Let X have the discrete uniform distribution on the 
integers 1, ...,. Compute the variance of X. Hint: You 
may wish to use the formula )77_, kK? =n(n + 1)- Qn + 
1)/6. 
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10. Consider the example efficient portfolio at the end of 
Example 4.3.7. Suppose that R; has the uniform distribu- 


11. Let X have the uniform distribution on the interval 
[0, 1]. Find the IOR of X. 


tion on the interval [a;, b;] for i = 1, 2. 
a. Find the two intervals [a,, b,] and [a9, b2]. Hint: The 


b. 


12. Let X have the p.d.f. f(x) = exp(—x) for x > 0, and 
f(x) =0 for x < 0. Find the IQR of X. 


intervals are determined by the means and variances. 


Find the value at risk (VaR) for the example portfolio 
at probability level 0.97. Hint: Review Example 3.9.5 
to see how to find the p.d.f. of the sum of two uniform 


random variables. 


Theorem 
4.4.1 


13. Let X have the binomial distribution with parameters 
5 and 0.3. Find the IOR of X. Hint: Return to Exam- 
ple 3.3.9 and Table 3.1. 


14. Let X be arandom variable whose interquartile range 
is n. Let Y =2X. Prove that the interquartile range of Y is 
2n. 


4.4 Moments 


For a random variable X, the means of powers X* (called moments) for k > 
2 have useful theoretical properties, and some of them are used for additional 
summaries of a distribution. The moment generating function is a related tool 
that aids in deriving distributions of sums of independent random variables and 
limiting properties of distributions. 


Existence of Moments 


For each random variable X and every positive integer k, the expectation E(X*) is 
called the kth moment of X. In particular, in accordance with this terminology, the 
mean of X is the first moment of X. 

It is said that the kth moment exists if and only if E(|X|) < oo. If the random 
variable X is bounded, that is, if there are finite numbers a and b such that Pr(a < 
X <b) =1, then all moments of X must necessarily exist. It is possible, however, that 
all moments of X exist even though X is not bounded. It is shown in the next theorem 
that if the kth moment of X exists, then all moments of lower order must also exist. 


If E(|X|*) < co for some positive integer k, then E(|X|/) < oo for every positive 
integer j such that j <k. 


Proof We shall assume, for convenience, that the distribution of X is continuous and 
the p.d.f. is f. Then 


E(X!/) = Lx f(x) dx 


= “H/F as +f “bl £0) dex 
Ix|< \x|> 


< | 1 fayax + | xt f(a) dx 
|x|<1 |x|>1 


< Pr(|X| <1) + E(\X|*). 


By hypothesis, E(|X|*) < oo. It therefore follows that E(|X|/) < oo. A similar proof 
holds for a discrete or a more general type of distribution. rT] 


In particular, it follows from Theorem 4.4.1 that if E (X2) < oo, then both the 
mean of X and the variance of X exist. Theorem 4.4.1 extends to the case in which 


Example 
4.4.1 


Definition 
4.4.1 
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j and k are arbitrary positive numbers rather than just integers. (See Exercise 15 in 
this section.) We will not make use of such a result in this text, however. 


Central Moments Suppose that X is a random variable for which E(X) = jy. For 
every positive integer k, the expectation E[(X — j1)]is called the kth central moment 
of X or the kth moment of X about the mean. In particular, in accordance with this 
terminology, the variance of X is the second central moment of X. 

For every distribution, the first central moment must be 0 because 


E(X —p)=u-—p=0. 


Furthermore, if the distribution of X is symmetric with respect to its mean jw, and if 
the central moment E[(X — j.)*] exists for a given odd integer k, then the value of 
E[(X — )‘] will be 0 because the positive and negative terms in this expectation will 
cancel one another. 


A Symmetric p.d.f. Suppose that X has a continuous distribution for which the p.d.f. 
has the following form: 


a 2: 
f@= ce & 3/2 for —00 < x < 00. 


We shall determine the mean of X and all the central moments. 
It can be shown that for every positive integer k, 


~ 2: 
i jx|'e !2 dx < 00. 
—0o 


Hence, all the moments of X exist. Furthermore, since f (x) is symmetric with respect 
to the point x = 3, then E(X) =3. Because of this symmetry, it also follows that 
E[(X — 3)*]=0 for every odd positive integer k. For even k = 2n, we can find a 
recursive formula for the sequence of central moments. First, let y =x — w in all 
the integral fomulas. Then, for n > 1, the 2nth central moment is 


oe 2 
Mp, =| y"ce-Y ay, 
—o0o 


Use integration by parts with u = y?”~! and dv = ye"! *dy. It follows that du = 
(2n — 1) y2"-2dy and v = —e7)”/2, So, 


[oe] CO 
man = f udv = wf f udu 
—0o ° —oo 


2n-1,,—y?/2|~° ~° an-2,,-y?/2 
er ee + (2n -»f yn ce-Y dy 
08 —0o 


=-y 


yo 
= (2n = 1)m2(n-1)- 
Because y° = 1, mo is just the integral of the p.d.f£.; hence, mp = 1. It follows that 


Mn = [aa2t — 1) forn =1, 2, .... So, for example, m2 = 1, m4 = 3, mp = 15, and so 
on. | 


Skewness In Example 4.4.1, we saw that the odd central moments are all 0 for a 
distribution that is symmetric. This leads to the following distributional summary that 
is used to measure lack of symmetry. 


Skewness. Let X be arandom variable with mean jz, standard deviation o, and finite 
third moment. The skewness of X is defined to be E[(X — w)°J/o°. 


236 


Chapter 4 Expectation 


Example 
4.4.2 


Definition 
4.4.2 


Theorem 
4.4.2 


The reason for dividing the third central moment by o? is to make the skewness 
measure only the lack of symmetry rather than the spread of the distribution. 


Skewness of Binomial Distributions. Let X have the binomial distribution with param- 
eters 10 and 0.25. The p.f. of this distribution appears in Fig. 4.8. It is not difficult to 
see that the p.f. is not symmetric. The skewness can be computed as follows: First, 
note that the mean is ~« = 10 x 0.25 = 2.5 and that the standard deviation is 


o = (10 x 0.25 x 0.75)!/? = 1.369. 


Second, compute 
3 3( 10 0 9 7510 3( 10 00 4 760 
E[(X — 2.5)"] = (0 — 2.5) 7 0.25" 0.75°° +---+ (10 — 2.5)" 10 0.25" 0.75 


= 0.9375. 
Finally, the skewness is 


0.9375 
1.3693 


For comparison, the skewness of the binomial distribution with parameters 10 and 0.2 
is 0.4743, and the skewness of the binomial distribution with parameters 10 and 0.3 
is 0.2761. The absolute value of the skewness increases as the probability of success 
moves away from 0.5. It is straightforward to show that the skewness of the binomial 
distribution with parameters n and p is the negative of the skewness of the binomial 
distribution with parameters n and 1 — p. (See Exercise 16 in this section.) < 


= 0.3652. 


Moment Generating Functions 


We shall now consider a different way to characterize the distribution of a random 
variable that is more closely related to its moments than to where its probability is 
distributed. 


Moment Generating Function. Let X be a random variable. For each real number r, 
define 


w(t) = E(e™). (4.4.1) 


The function y(t) is called the moment generating function (abbreviated m.g.f.) of X. 


Note: The Moment Generating Function of X Depends Only on the Distribution 
of X. Since the m.g.f. is the expected value of a function of X, it must depend only 
on the distribution of X. If X and Y have the same distribution, they must have the 
same m.g.f. 

If the random variable X is bounded, then the expectation in Eq. (4.4.1) must 
be finite for all values of r. In this case, therefore, the m.g.f. of X will be finite for all 
values of t. On the other hand, if X is not bounded, then the m.g.f. might be finite for 
some values of t and might not be finite for others. It can be seen from Eq. (4.4.1), 
however, that for every random variable X, the m.g.f. y(t) must be finite at the point 
t = 0 and at that point its value must be (0) = E(1) = 1. 

The next result explains how the name “moment generating function” arose. 


Let X be arandom variables whose m.g.f. w(t) is finite for all values of t in some open 
interval around the point t = 0. Then, for each integer n > 0, the nth moment of X, 


Example 
4.4.3 


Theorem 
4.4.3 


Example 
4.4.4 
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E(X"), is finite and equals the nth derivative y(t) at t = 0. That is, E(X”) = yw (0) 
forn=1,2,.... 


We sketch the proof at the end of this section. 


Calculating an m.g.f. Suppose that X is a random variable for which the p.d.f. is as 
follows: 
e* forx>0, 


jo| 


We shall determine the m.g.f. of X and also Var(X). 
For each real number f, 


0 otherwise. 


Wit) = Ee) = / * ote dx 
0 


[oe] 
= ef Dx dy, 
0) 


The final integral in this equation will be finite if and only if t < 1. Therefore, y(r) is 
finite only for t < 1. For each such value of fr, 


1 
v= 


Since w(t) is finite for all values of ¢ in an open interval around the point t = 0, 
all moments of X exist. The first two derivatives of wy are 


1 2 
/ ees d " ie 
nae (1 — 1)? ee ee d—1)3 
Therefore, E(X) = w'(0) = 1and E(X’) = w(0) =2. It now follows that 
Var(X) = w"(0) — [WP =1. < 


Properties of Moment Generating Functions 


We shall now present three basic theorems pertaining to moment generating func- 
tions. 


Let X be a random variable for which the m.g.f. is yy; let Y = aX + b, where a and b 
are given constants; and let 7, denote the m.g.f. of Y. Then for every value of t such 
that W,(ar) is finite, 


W(t) = ew, (at). (4.4.2) 
Proof By the definition of an m.g.-f., 
Wo(t) = E(e'’) = E[e’@*t)] = ce E(e%*) = ce Wi (at). = 


Calculating the m.g.f. of a Linear Function. Suppose that the distribution of X is as 
specified in Example 4.4.3. We saw that the m.g.f. of X for t < 1is 


1 
t) = —.. 
Wit) cat 
If Y =3— 2X, then the m.g-f. of Y is finite for t > —1/2 and will have the value 


et 


1+2r° 


Wy(t) = e* yy (—21) = 
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Theorem 
4.4.4 


Theorem 
4.4.5 


The next theorem shows that the m.g.f. of the sum of an arbitrary number of 
independent random variables has a very simple form. Because of this property, the 
m.g.f. is an important tool in the study of such sums. 


Suppose that X,,..., X,, are n independent random variables; and fori =1,...,n, 
let y; denote the m.g.f. of X;. Let Y = X,;+---+ X,, and let the m.g.f. of Y be denoted 
by w. Then for every value of t such that w;(¢) is finite fori =1,...,7, 
n 
v=] vi. (4.4.3) 


i=l 


Proof By definition, 


WO) = Ble) = Ble) = (I os), 


i=l 
Since the random variables X;,..., X, are independent, it follows from Theo- 
rem 4.2.6 that 


n n 
(11 “) = E(e'*‘), 
i=] 


i=l 
Hence, 


voO=[[ vi. " 


i=l 


The Moment Generating Function for the Binomial Distribution Suppose that 
a random variable X has the binomial distribution with parameters n and p. In 
Sections 4.2 and 4.3, the mean and the variance of X were determined by representing 
X as the sum of n independent random variables X;,..., X,,. In this representation, 
the distribution of each variable X; is as follows: 


Prix; =1)=p and Pr(x;=0)=1-p. 


We shall now use this representation to determine the m.g.f. of X = X,+---+ X;,. 
Since each of the random variables X;,..., X, has the same distribution, the 
m.g.f. of each variable will be the same. Fori = 1,...,”, the m.g.f. of X; is 


Wi(t) = Ee!) = (e') Pr(X; = 1) + (1) Pr(X; = 0) 
= pe’+1—p. 
It follows from Theorem 4.4.4 that the m.g.f. of X in this case is 
W(t) = (pe +1- p)”. (4.4.4) 


Uniqueness of Moment Generating Functions We shall now state one more im- 
portant property of the m.g.f. The proof of this property is beyond the scope of this 
book and is omitted. 


If the m.g.f’s of two random variables X; and X> are finite and identical for all values 
of t in an open interval around the point t = 0, then the probability distributions of 
X, and X, must be identical. a 


o, 
“ee 
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Theorem 4.4.5 is the justification for the claim made at the start of this discussion, 
namely, that the m.g.f. is another way to characterize the distribution of a random 
variable. 


The Additive Property of the Binomial Distribution Moment generating functions 
provide a simple way to derive the distribution of the sum of two independent 
binomial random variables with the same second parameter. 


Theorem If X,; and X> are independent random variables, and if X; has the binomial distribu- 
4.4.6 tion with parameters n; and p (i = 1, 2), then X, + X> has the binomial distribution 
with parameters 1, +7 and p. 
Proof L et y; denote the m.g.f. of X; for i = 1, 2. It follows from Eq. (4.4.4) that 
W(t) = (pe! +1— py”. 
Let y denote the m.g.f. of X; + X>. Then, by Theorem 4.4.4, 
w(t) = (pe +1— pyr. 


It can be seen from Eq. (4.4.4) that this function wy is the m.g.f. of the binomial 
distribution with parameters n; +n and p. Hence, by Theorem 4.4.5, the distribution 
of X; + X, must be that binomial distribution. rT] 


Sketch of the Proof of Theorem 4.4.2 


First, we indicate why all moments of X are finite. Let t > 0 be such that both y(t) 
and w(—r) are finite. Define g(x) = e’* +e". Notice that 


E[g(X)]= W(t) + W(t) < oo. (4.4.5) 


On every bounded interval of x values, g(x) is bounded. For each integer n > 0, as 
|x| > ov, g(x) is eventually larger than |x|”. It follows from these facts and (4.4.5) 
that E|X"| < oo. 

Although it is beyond the scope of this book, it can be shown that the derivative 
w(t) exists at the point t = 0, and that at t = 0, the derivative of the expectation in 
Eq. (4.4.1) must be equal to the expectation of the derivative. Thus, 


/ _ ad tX _ a tX 
a la 


(fe) = (Xe'*),_ 9 =X. 
dt 1-0 


But 


It follows that 
w'(O) = E(X). 


In other words, the derivative of the m.g.f. y(t) at t = 0 is the mean of X. 

Furthermore, it can be shown that it is possible to differentiate y(t) an arbitrary 
number of times at the point t= 0. For n =1,2,..., the nth derivative y(0) at 
t = 0 will satisfy the following relation: 


(n) = a *)| _ (= a) 
¥ o=| Fee a di” t=0 


— UC Game Te | = E(X"). 
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Thus, w/(0) = E(X), w’(0) = E(X?), w’”(0) = E(X), and so on. Hence, we see that 
the m.g.f., if it is finite in an open interval around t = 0, can be used to generate all 
of the moments of the distribution by taking derivatives at t = 0. 


nS 


¢ 


Summary 


If the Ath moment of a random variable exists, then so does the jth moment for every 
j <k. The moment generating function of X, w(t) = E(e'*), if it is finite for t in a 
neighborhood of 0, can be used to find moments of X. The kth derivative of w(t) at 
t =O is E(X*). The m.g.f. characterizes the distribution in the sense that all random 


variables that have the same m.g.f. have the same distribution. 


Exercises 


1. If X has the uniform distribution on the interval [a, b], 
what is the value of the fifth central moment of X? 


2. If X has the uniform distribution on the interval [a, b], 
write a formula for every even central moment of X. 


3. Suppose that X is a random variable for which E(X) = 
1, E(X’) =2, and E(X*) =5. Find the value of the third 
central moment of X. 


4. Suppose that X is a random variable such that E(X) 
is finite. (a) Show that E(X”) > [E(X)f. (b) Show that 
E(X?) = [E(X)Ff if and only if there exists a constant c 
such that Pr(X = c) = 1. Hint: Var(X) > 0. 


5. Suppose that X is a random variable with mean jy and 
variance o”, and that the fourth moment of X is finite. 
Show that 


E[(X — p)*]> 04. 


6. Suppose that X has the uniform distribution on the 
interval [a, b]. Determine the m.g.f. of X. 


7. Suppose that X is a random variable for which the m.g.f. 
is as follows: 


vtH= iOd +e) for —co <t <o. 


Find the mean and the variance of X. 


8. Suppose that X isa random variable for which the m.g.f. 
is as follows: 


2 
w(t) =e + for —0co <t <0. 
Find the mean and the variance of X. 


9. Let X be arandom variable with mean jz and variance 
o”, and let y(t) denote the m.g.f. of X for —oo <t <0. 
Let c be a given positive constant, and let Y be a random 


variable for which the m.g.f. is 
v(t) =eMO-U for —c0 <t < oo. 


Find expressions for the mean and the variance of Y in 
terms of the mean and the variance of X. 


10. Suppose that the random variables X and Y are i.i.d. 
and that the m.g.f. of each is 


v(t) = ef +3" for —00 <t < 00. 

Find the m.g.f. of Z =2X — 3Y +4. 
11. Suppose that X is a random variable for which the 
m.g.f. is as follows: 

w(t)= ze + =e + =e for —0co <f <ox. 
Find the probability distribution of X. Hint: It is a simple 
discrete distribution. 
12. Suppose that X is a random variable for which the 
m.g.f. is as follows: 


VW(t)= a+ e+e) for —c <t<o. 


Find the probability distribution of X. 


13. Let X have the Cauchy distribution (see Example 
4.1.8). Prove that the m.g.f. w(r) is finite only for r = 0. 


14. Let X have p.d.f. 


ios) eel 

0 otherwise. 

Prove that the m.g.f. y(r) is finite for all t < 0 but for no 
t>0. 
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15. Prove the following extension of Theorem 4.4.1: If 16. Let X have the binomial distribution with parameters 
E(|X |“) < oo for some positive number a, then E(|X >) < n and p. Let Y have the binomial distribution with pa- 
oo for every positive number b < a. Give the proof for the rameters n and 1 — p. Prove that the skewness of Y is the 
case in which X has a discrete distribution. negative of the skewness of X. Hint: Let Z =n — X and 


Definition 
4.5.1 


show that Z has the same distribution as Y. 


17. Find the skewness of the distribution in Example 4.4.3. 


4.5 The Mean and the Median 


Although the mean of a distribution is a measure of central location, the median 
(see Definition 3.3.3) is also a measure of central location for a distribution. 
This section presents some comparisons and contrasts between these two location 
summaries of a distribution. 


The Median 


It was mentioned in Sec. 4.1 that the mean of a probability distribution on the real 
line will be at the center of gravity of that distribution. In this sense, the mean of a 
distribution can be regarded as the center of the distribution. There is another point 
on the line that might also be regarded as the center of the distribution. Suppose 
that there is a point mg that divides the total probability into two equal parts, that 
is, the probability to the left of mp is 1/2, and the probability to the right of mo is 
also 1/2. For a continuous distribution, the median of the distribution introduced 
in Definition 3.3.3 is such a number. If there is such an mpg, it could legitimately be 
called a center of the distribution. It should be noted, however, that for some discrete 
distributions there will not be any point at which the total probability is divided into 
two parts that are exactly equal. Moreover, for other distributions, which may be 
either discrete or continuous, there will be more than one such point. Therefore, the 
formal definition of a median, which will now be given, must be general enough to 
include these possibilities. 


Median. Let X be a random variable. Every number m with the following property 
is called a median of the distribution of X: 
Pr(X <m)>1/2 and Pr(x >m)>1/2. 
Another way to understand this definition is that a median is a point m that 


satisfies the following two requirements: First, if m is included with the values of X 
to the left of m, then 


Pr(X <m) > Pr(X >™m). 
Second, if m is included with the values of X to the right of m, then 
Pr(X > m) > Pr(X <™m). 


If there is a number m such that Pr(X < m) = Pr(X > m), that is, if the number m 
does actually divide the total probability into two equal parts, then m will of course 
be a median of the distribution of X (see Exercise 16). 


Note: Multiple Medians. One can prove that every distribution must have at least 
one median. Indeed, the 1/2 quantile from Definition 3.3.2 is a median. (See Exer- 
cise 1.) For some distributions, every number in some interval is a median. In such 
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Example 
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Example 
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cases, the 1/2 quantile is the minimum of the set of all medians. When a whole interval 
of numbers are medians of a distribution, some writers refer to the midpoint of the 
interval as the median. 


The Median of a Discrete Distribution. Suppose that X has the following discrete 
distribution: 


Pr(X = 1) = 0.1, Pr(X = 2) =0.2, 
Pr(X = 3) = 0.3, Pr(Xx = 4) = 0.4. 
The value 3 is a median of this distribution because Pr(X < 3) = 0.6, which is greater 


than 1/2, and Pr(X > 3) = 0.7, which is also greater than 1/2. Furthermore, 3 is the 
unique median of this distribution. < 


A Discrete Distribution for Which the Median Is Not Unique. Suppose that X has the 
following discrete distribution: 

Pr(X = 1) = 0.1, Pr(X = 2) = 0.4, 

Pr(X = 3) = 0.3, Pr(X = 4) = 0.2. 
Here, Pr(X <2) = 1/2, and Pr(X > 3) = 1/2. Therefore, every value of m in the closed 


interval 2 < m <3 will be a median of this distribution. The most popular choice of 
median of this distribution would be the midpoint 2.5. 4 


The Median of a Continuous Distribution. Suppose that X has a continuous distribution 
for which the p.d.f. is as follows: 


poy = | for0 <x <1, 


0 otherwise. 
The unique median of this distribution will be the number m such that 


m 1 1 
/ Ax3 dx = Ax3 dx =<. 
0 m 2 


This number is m = 1/2"/4. < 


A Continuous Distribution for Which the Median Is Not Unique. Suppose that X has a 
continuous distribution for which the p.d.f. is as follows: 


1/2 forO<x <1, 
FO=H=41 for2s<% <3, 
0 otherwise. 
Here, for every value of m in the closed interval 1 < m < 2.5, Pr(X <m) = Pr(x => 


m) = 1/2. Therefore, every value of m in the interval 1 < m < 2.5 is a median of this 
distribution. <J 


Comparison of the Mean and the Median 


Last Lottery Number. In a state lottery game, a three-digit number from 000 to 999 
is drawn each day. After several years, all but one of the 1000 possible numbers has 
been drawn. A lottery official would like to predict how much longer it will be until 
that missing number is finally drawn. Let X be the number of days (X = 1 being 
tomorrow) until that number appears. It is not difficult to determine the distribution 
of X, assuming that all 1000 numbers are equally likely to be drawn each day and 
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that the draws are independent. Let A, stand for the event that the missing number 
is drawn on day x for x = 1, 2,.... Then {X = 1} = Aj, and for x > 1, 


{X =x}=ASN---MAS_,NA,. 


Since the A, events are independent and all have probability 0.001, it is easy to see 
that the p.f. of X is 


x—-1 a 
f= | aaa forx =1,2,... 


otherwise. 


But, the lottery official wants to give a single-number prediction for when the number 
will be drawn. What summary of the distribution would be appropriate for this 
prediction? <l 


The lottery official in Example 4.5.5 wants some sort of “average” or “middle” 
number to summarize the distribution of the number of days until the last number 
appears. Presumably she wants a prediction that is neither excessively large nor too 
small. Either the mean or a median of X can be used as such a summary of the 
distribution. Some important properties of the mean have already been described in 
this chapter, and several more properties will be given later in the book. However, for 
many purposes the median is a more useful measure of the middle of the distribution 
than is the mean. For example, every distribution has a median, but not every 
distribution has a mean. As illustrated in Example 4.3.5, the mean of a distribution 
can be made very large by removing a small but positive amount of probability from 
any part of the distribution and assigning this amount to a sufficiently large value of x. 
On the other hand, the median may be unaffected by a similar change in probabilities. 
If any amount of probability is removed from a value of x larger than the median 
and assigned to an arbitrarily large value of x, the median of the new distribution 
will be the same as that of the original distribution. In Example 4.3.5, all numbers in 
the interval [0, 1] are medians of both random variables X and Y despite the large 
difference in their means. 


Annual Incomes. Suppose that the mean annual income among the families in a 
certain community is $30,000. It is possible that only a few families in the community 
actually have an income as large as $30,000, but those few families have incomes that 
are very much larger than $30,000. As an extreme example, suppose that there are 
100 families and 99 of them have income of $1,000 while the other one has income 
of $2,901,000. If, however, the median annual income among the families is $30,000, 
then at least one-half of the families must have incomes of $30,000 or more. <j 


The median has one convenient property that the mean does not have. 


One-to-One Function. Let X be a random variable that takes values in an interval J 
of real numbers. Let r be a one-to-one function defined on the interval J. If m isa 
median of X, then r(m) is a median of r(X). 


Proof Let Y =r(X). We need to show that Pr(Y > r(m)) => 1/2 and Pr(Y < r(m)) > 
1/2. Since r is one-to-one on the interval /, it must be either increasing or decreasing 
over the interval /. If r is increasing, then Y > r(m) if and only if X > m, so Pr(Y => 
r(m)) = Pr(X > m) > 1/2. Similarly, Y < r(m) if and only if X < mand Pr(Y¥ <r(m)) = 
1/2 also. If r is decreasing, then Y > r(m) if and only if X < m. The remainder of the 
proof is then similar to the preceding. rT] 
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We shall now consider two specific criteria by which the prediction of a random 
variable X can be judged. By the first criterion, the optimal prediction that can be 
made is the mean. By the second criterion, the optimal prediction is the median. 


Minimizing the Mean Squared Error 


Suppose that X is a random variable with mean yw and variance o7. Suppose also that 
the value of X is to be observed in some experiment, but this value must be predicted 
before the observation can be made. One basis for making the prediction is to select 
some number d for which the expected value of the square of the error X — d will be 
a minimum. 


Mean Squared Error/M.S.E.. The number E[(X — d)”]is called the mean squared error 
(M.S.E.) of the prediction d. 


The next result shows that the number d for which the M.S.E. is minimized is 
E(X). 


Let X be a random variable with finite variance o”, and let ~ = E(X). For every 
number d, 


E[(X — p)"] s E[(X — 4)"]. (4.5.1) 


Furthermore, there will be equality in the relation (4.5.1) if and only if d = yw. 


Proof For every value of d, 
E((® =<d)|=2O" =2ax4a) 
= E(X*) —2du+d?. (4.5.2) 


The final expression in Eq. (4.5.2) is simply a quadratic function of d. By elementary 
differentiation it will be found that the minimum value of this function is attained 
when d=. Hence, in order to minimize the M.S.E., the predicted value of X 
should be its mean jz. Furthermore, when this prediction is used, the M.S.E. is simply 
E[(X — p)?]=07. 7 


Last Lottery Number. In Example 4.5.5, we discussed a state lottery in which one 
number had never yet been drawn. Let X stand for the number of days until that 
last number is eventually drawn. The p.f. of X was computed in Example 4.5.5 as 


x-1 = 
f= a forx =1,2,... 


otherwise. 
We can compute the mean of X as 
oe} [o,@) 
E(X) = ) © x0.001(0.999)*~! = 0.001  ° x(0.999)""!. (4.5.3) 
x=1 x=1 
At first, this sum does not look like one that is easy to compute. However, it is closely 
related to the general sum 


= 1 
so) =) y= —, 
x=0 ae. 
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if 0 < y < 1. Using properties of power series from calculus, we know that the deriva- 
tive of g(y) can be found by differentiating the individual terms of the power series. 
That is, 
[o.@) loo} 
202) a => a 

x=0 x=1 
for 0 < y < 1. But we also know that g’(y) = 1/(1 — y)’. The last sum in Eq. (4.5.3) is 
g’(0.999) = 1/(0.001)2. It follows that 


E(X) =0.001 1000. < 


(0.0012 — 


Minimizing the Mean Absolute Error 


Another possible basis for predicting the value of a random variable X is to choose 
some number d for which E(|X — d|) will be a minimum. 


Mean Absolute Error/M.A.E. The number E(|X — d}) is called the mean absolute error 
(M.A.E.) of the prediction d. 


We shall now show that the M.A.E. is minimized when the chosen value of d is a 
median of the distribution of X. 


Let X be arandom variable with finite mean, and let m be a median of the distribution 
of X. For every number d, 

E(|\X —m|) < E(\X —d). (4.5.4) 
Furthermore, there will be equality in the relation (4.5.4) if and only if d is also a 
median of the distribution of X. 


Proof For convenience, we shall assume that X has a continuous distribution for 
which the p.d.f. is f. The proof for any other type of distribution is similar. Suppose 
first that d > m. Then 


E(|X —d|) — E(\x —m) = | (jx — d| — |x — ml) f(x) dx 


m d lee) 
=i (d= my fy dx+ [ d+m—2f0 dx + f (m — d) f (x) dx 
lo) n d 


1 


m d lee) 
> / (d —m) f (x) dx + / (m — d) f (x) dx + : (m — d) f (x) dx 
—0o m d 


= (d —m)[Pr(X < m) — Pr(X > m)]. (4.5.5) 
Since m is a median of the distribution of X, it follows that 
Pr(X <m) > 1/2 > Pr(x > m). (4.5.6) 


The final difference in the relation (4.5.5) is therefore nonnegative. Hence, 
E(|X —d|) > E(\X —m)). (4.5.7) 


Furthermore, there can be equality in the relation (4.5.7) only if the inequalities in 
relations (4.5.5) and (4.5.6) are actually equalities. A careful analysis shows that these 
inequalities will be equalities only if d is also a median of the distribution of X. 

The proof for every value of d such that d < m is similar. a 
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Last Lottery Number. In Example 4.5.5, in order to compute the median of X, we must 
find the smallest number x such that the c.d-f. F(x) > 0.5. For integer x, we have 


F(x) = 3 0.001(0.999)"—!. 


n=1 


We can use the popular formula 


> n 1— ane 
y — 
n=0 ee y 
to see that, for integer x > 1, 
F(x) = 0.0011 = 0.99)" _ 4 (0.999)*. 
1— 0.999 


Setting this equal to 0.5 and solving for x gives x = 692.8; hence, the median of X is 
693. The median is unique because F (x) never takes the exact value 0.5 for any integer 
x. The median of X is much smaller than the mean of 1000 found in Example 4.5.7. 

< 


The reason that the mean is so much larger than the median in Examples 4.5.7 
and 4.5.8 is that the distribution has probability at arbitrarily large values but is 
bounded below. The probability at these large values pulls the mean up because there 
is no probability at equally small values to balance. The median is not affected by 
how the upper half of the probability is distributed. The following example involves 
a symmetric distribution. Here, the mean and median(s) are more similar. 


Predicting a Discrete Uniform Random Variable. Suppose that the probability is 1/6 
that a random variable X will take each of the following six values: 1, 2, 3, 4, 5, 6. We 
shall determine the prediction for which the M.S.E. is minimum and the prediction 
for which the M.A.E. is minimum. 

In this example, 


E(X) = A+ 243444546) =35, 


Therefore, the M.S.E. will be minimized by the unique value d = 3.5. 

Also, every number m in the closed interval 3 < m < 4 is a median of the given 
distribution. Therefore, the M.A.E. will be minimized by every value of d such that 
3 <d <4 and only by such a value of d. Because the distribution of X is symmetric, 
the mean of X is also a median of X. < 


Note: When the M.A.E. and M.S.E. Are Finite. We noted that the median exists for 
every distribution, but the M.A.E. is finite if and only if the distribution has a finite 
mean. Similarly, the M.S.E. is finite if and only if the distribution has a finite variance. 


Summary 


A median of X is any number m such that Pr(X < m) > 1/2 and Pr(X > m) > 1/2. 
To minimize E(|X — d|) by choice of d, one must choose d to be a median of X. To 
minimize E[(X — d)*] by choice of d, one must choose d = E(X). 


Exercises 


1. Prove that the 1/2 quantile as defined in Definition 3.3.2 
is a median as defined in Definition 4.5.1. 


2. Suppose that a random variable X has a discrete distri- 
bution for which the p.f. is as follows: 


cx forx =1,2,3,4,5,6, 
0 otherwise. 


fe | 


Determine all the medians of this distribution. 


3. Suppose that a random variable X has a continuous 
distribution for which the p.d.f. is as follows: 


fie e ~*~ forx>0, 


0 otherwise. 
Determine all the medians of this distribution. 


4. In a small community consisting of 153 families, the 
number of families that have k children (k =0, 1, 2, ...) 
is given in the following table: 


Number of Number of 
children families 
0 21 
1 40 
2 42 
3 27 

4 or more 23 


Determine the mean and the median of the number of 
children per family. (For the mean, assume that all families 
with four or more children have only four children. Why 
doesn’t this point matter for the median?) 


5. Suppose that an observed value of X is equally likely to 
come from a continuous distribution for which the p.d-f. 
is f or from one for which the p.d.f. is g. Suppose that 
f(x) > 0 for 0 <x <1and f(x) =0 otherwise, and sup- 
pose also that g(x) > 0 for 2 <x <4 and g(x) = 0 other- 
wise. Determine: (a) the mean and 

(b) the median of the distribution of X. 


6. Suppose that a random variable X has a continuous 
distribution for which the p.d.f. f is as follows: 


2x for0<x <1, 
QO otherwise. 


ro=| 


Determine the value of d that minimizes 
(a) E[(X — d)’] and (b) E(|X — d|). 


7. Suppose that a person’s score X on a certain examina- 
tion will be a number in the interval 0 < X <1 and that 
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X has a continuous distribution for which the p.d.f. is as 
follows: 


1 
peya [3 for0<x <1, 


0 otherwise. 


Determine the prediction of X that minimizes (a) the 
M.S.E. and (b) the M.A.E. 


8. Suppose that the distribution of a random variable 
X is symmetric with respect to the point x = 0 and that 
E(X*) < oo. Show that E[(X — d)*] is minimized by the 
value d = 0. 


9. Suppose that a fire can occur at any one of five points 
along a road. These points are located at —3, —1, 0, 1, and 
2 in Fig. 4.9. Suppose also that the probability that each of 
these points will be the location of the next fire that occurs 
along the road is as specified in Fig. 4.9. 


0.4 


0.2 0.2 
| 0.1 | 0.1 


> 
=3 -1 0 1 2 Road 


Figure 4.9 Probabilities for Exercise 9. 


a. At what point along the road should a fire engine 
wait in order to minimize the expected value of the 
square of the distance that it must travel to the next 
fire? 


b. Where should the fire engine wait to minimize the 
expected value of the distance that it must travel to 
the next fire? 


10. If n houses are located at various points along a 
straight road, at what point along the road should a store 
be located in order to minimize the sum of the distances 
from the n houses to the store? 


11. Let X be a random variable having the binomial dis- 
tribution with parameters n = 7 and p = 1/4, and let Y be 
a random variable having the binomial distribution with 
parameters n = 5 and p = 1/2. Which of these two random 
variables can be predicted with the smaller M.S.E.? 


12. Consider a coin for which the probability of obtaining 
a head on each given toss is 0.3. Suppose that the coin is to 
be tossed 15 times, and let X denote the number of heads 
that will be obtained. 

a. What prediction of X has the smallest M.S.E.? 


b. What prediction of X has the smallest M.A.E.? 


13. Suppose that the distribution of X is symmetric 
around a point m. Prove that m is a median of X. 
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14. Find the median of the Cauchy distribution defined in 
Example 4.1.8. 


15. Let X be arandom variable with c.d.f. F. Suppose that 
a <b are numbers such that both a and b are medians of 
XxX. 


a. Prove that F(a) = 1/2. 


b. Prove that there exist a smallest c <a and a largest 
d > b such that every number in the closed interval 
[c, d] is a median of X. 


c. If X has a discrete distribution, prove that F(d) > 
1/2. 


16. Let X be a random variable. Suppose that there exists 
a number m such that Pr(X < m) = Pr(X > m). Prove that 
m is a median of the distribution of X. 


17. Let X be a random variable. Suppose that there exists 
a number m such that Pr(X < m) < 1/2 and Pr(X > m) < 
1/2. Prove that m is the unique median of the distribution 
of X. 


18. Prove the following extension of Theorem 4.5.1. Let 
m be the p quantile of the random variable X. (See Defi- 
nition 3.3.2.) Ifr is a strictly increasing function, then r(m) 
is the p quantile of r(X). 


4.6 Covariance and Correlation 


When we are interested in the joint distribution of two random variables, it is useful 
to have a summary of how much the two random variables depend on each other. 
The covariance and correlation are attempts to measure that dependence, but they 
only capture a particular type of dependence, namely linear dependence. 


Covariance 


Example 
4.6.1 


Test Scores. When applying for college, high school students often take a number of 
standardized tests. Consider a particular student who will take both a verbal and a 


quantitative test. Let X be this student’s score on the verbal test, and let Y be the 
same student’s score on the quantitative test. Although there are students who do 
much better on one test than the other, it might still be reasonable to expect that a 
student who does very well on one test to do at least a little better than average on 
the other. We would like to find a numerical summary of the joint distribution of X 
and Y that reflects the degree to which we believe a high or low score on one test will 
be accompanied by a high or low score on the other test. <1 


When we consider the joint distribution of two random variables, the means, the 
medians, and the variances of the variables provide useful information about their 
marginal distributions. However, these values do not provide any information about 
the relationship between the two variables or about their tendency to vary together 
rather than independently. In this section and the next one, we shall introduce 
summaries of a joint distribution that enable us to measure the association between 
two random variables, determine the variance of the sum of an arbitrary number of 
dependent random variables, and predict the value of one random variable by using 
the observed value of some other related variable. 


Definition 
4.6.1 
as 


Cov(X, Y) = E[(X — py) (Y — py), 


Covariance. Let X and Y be random variables having finite means. Let E(X) = wy 
and E(Y) = ty The covariance of X and Y , whichis denoted by Cov(X, Y), is defined 


(4.6.1) 


if the expectation in Eq. (4.6.1) exists. 
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It can be shown (see Exercise 2 at the end of this section) that if both X and Y 
have finite variance, then the expectation in Eq. (4.6.1) will exist and Cov(X, Y) will 
be finite. However, the value of Cov(X, Y) can be positive, negative, or zero. 


Test Scores. Let X and Y be the test scores in Example 4.6.1, and suppose that they 
have the joint p.d-f. 


2xy+0.5 forO<x<land0O<y<l, 
0 otherwise. 


fon =| 


We shall compute the covariance Cov(X, Y). First, we shall compute the means jy 
and jy of X and Y, respectively. The symmetry in the joint p.d.f. means that X and 
Y have the same marginal distribution; hence, wy = wy. We see that 


1 el 
y= / [2x7y + 0.5x]dydx 
0 JO 


117 
3 4 12 


1 
a [x7 + 0.5x]dx = : 
0 
so that wy = 7/12 as well. The covariance can be computed using Theorem 4.1.2. 
Specifically, we must evaluate the integral 


1 pl 7 7 
[ [ (: - 5) (> a 5) (2xy + 0.5) dy dx. 


This integral is straightforward, albeit tedious, to compute, and the result is 
Cov(X, Y) = 1/144. < 


The following result often simplifies the calculation of a covariance. 


For all random variables X and Y such that o < oo and a. < OO, 


Cov(X, Y) = E(XY) — E(X)E(Y). (4.6.2) 


Proof It follows from Eq. (4.6.1) that 


Cov(X, Y) = E(XY — wyY — wyX + byby) 
= E(XY)— pyE(Y) — wyE(X) + Uyby. 


Since E(X) = wy and E(Y) = py, Eq. (4.6.2) is obtained. a 


The covariance between X and Y is intended to measure the degree to which 
X and Y tend to be large at the same time or the degree to which one tends to be 
large while the other is small. Some intution about this interpretation can be gathered 
from a careful look at Eq. (4.6.1). For example, suppose that Cov(X, Y) is positive. 
Then X > jy and Y > wy must occur together and/or X < wy and Y < jy must occur 
together to a larger extent than X < wy occurs with Y > wy and X > wy occurs with 
Y < py. Otherwise, the mean would be negative. Similarly, if Cov(X, Y) is negative, 
then X > wy and Y < wy must occur together and/or X < wy and Y > jy must occur 
together to larger extent than the other two inequalities. If Cov(X, Y) = 0, then the 
extent to which X and Y are on the same sides of their respective means exactly 
balances the extent to which they are on opposite sides of their means. 
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Correlation 


Although Cov(X, Y) gives a numerical measure of the degree to which X and Y vary 
together, the magnitude of Cov(X, Y) is also influenced by the overall magnitudes of 
X and Y. For example, in Exercise 5 in this section, you can prove that Cov(2X, Y) = 
2 Cov(X, Y). In order to obtain a measure of association between X and Y that is 
not driven by arbitrary changes in the scales of one or the other random variable, we 
define a slightly different quantity next. 

Correlation. Let X and Y be random variables with finite variances oy and az; re- 
spectively. Then the correlation of X and Y, which is denoted by p(X, Y), is defined 
as follows: 

Cov(X, Y) 


OxOy 


p(X, Y)= (4.6.3) 


In order to determine the range of possible values of the correlation p(X, Y), we 
shall need the following result. 


Schwarz Inequality. For all random variables U and V such that E(UV) exists, 
[E(UV)P < E(U*)E(V?). (4.6.4) 


If, in addition, the right-hand side of Eq. (4.6.4) is finite, then the two sides of 
Eq. (4.6.4) equal the same value if and only if there are nonzero constants a and 
b such that aU + bV =0 with probability 1. 


Proof If E(U2) =0, then Pr(U =0) = 1. Therefore, it must also be true that Pr(UV = 
0) = 1. Hence, E(UV) = 0, and the relation (4.6.4) is satisfied. Similarly, if E(V7) = 0, 
then the relation (4.6.4) will be satisfied. Moreover, if either E(U*) or E(V7) is 
infinite, then the right side of the relation (4.6.4) will be infinite. In this case, the 
relation (4.6.4) will surely be satisfied. 

For the rest of the proof, assume that 0 < E(U2) < oo and 0 < E(V”) < oo. For 
all numbers a and b, 


0 < E[(aU + bV)*] =a? E(U’) + b°E(V’) + 2abE(UV) (4.6.5) 
and 
0 < E[(aU — bV)*]=a?E(U’) + b°E(V’) — 2abE(UV). (4.6.6) 


If we let a =[E(V7)]!”* and b = [E(U”)]”, then it follows from the relation (4.6.5) 
that 


E(UV) >-[E(U”)E(V’)]}/?. (4.6.7) 
It also follows from the relation (4.6.6) that 
E(UV) <[E(W7)E(V?)}?. (4.6.8) 


These two relations together imply that the relation (4.6.4) is satisfied. 

Finally, suppose that the right-hand side of Eq. (4.6.4) is finite. Both sides of 
(4.6.4) equal the same value if and only if the same is true for either (4.6.7) or (4.6.8). 
Both sides of (4.6.7) equal the same value if and only if the rightmost expression in 
(4.6.5) is 0. This, in turn, is true if and only if E[(aU + bV)*] =, which occurs if and 
only if aU + bV =0 with probability 1. The reader can easily check that both sides 
of (4.6.8) equal the same value if and only if aU — bV =0 with probability 1. rT] 
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A slight variant on Theorem 4.6.2 is the result we want. 


Cauchy-Schwarz Inequality. Let X and Y be random variables with finite variance. 
Then 
[Cov(X, Y)P < o{o;, (4.6.9) 


and 
—-1< p(X, Y)<1. (4.6.10) 


Furthermore, the inequality in Eq. (4.6.9) is an equality if and only if there are 
nonzero constants a and b and a constant c such that aX + bY =c with probability 1. 


Proof Let U = X — py andV =Y — py. Eq. (4.6.9) now follows directly from Theo- 
rem 4.6.2. In turn, it follows from Eq. (4.6.3) that [o(X, Y)f < 1 or, equivalently, that 
Eq. (4.6.10) holds. The final claim follows easily from the similar claim at the end of 
Theorem 4.6.2. 7 


Positively/Negatively Correlated/Uncorrelated. It is said that X and Y are positively 
correlated if p(X, Y) > 0, that X and Y are negatively correlated if p(X, Y) < 0, and 
that X and Y are uncorrelated if p(X, Y) =0. 


It can be seen from Eq. (4.6.3) that Cov(X, Y) and p(X, Y) must have the same 
sign; that is, both are positive, or both are negative, or both are zero. 


Test Scores. For the two test scores in Example 4.6.2, we can compute the correlation 
p(X, Y). The variances of X and Y are both equal to 11/144, so the correlation is 
p(X, Y) =1/11. < 


Properties of Covariance and Correlation 


We shall now present four theorems pertaining to the basic properties of covariance 
and correlation. 

The first theorem shows that independent random variables must be uncorre- 
lated. 


2 
xX 


2 


7 <%, then 


If X and Y are independent random variables with 0 < og < coand0<o 


Cov(X, Y) = o(X, Y) =0. 


Proof If X and Y are independent, then E(XY) = E(X)E(Y). Therefore, by Eq. 
(4.6.2), Cov(X, Y) =0. Also, it follows that p(X, Y) =0. a 


The converse of Theorem 4.6.4 is not true as a general rule. Two dependent 
random variables can be uncorrelated. Indeed, even though Y is an explicit function 
of X, it is possible that p(X, Y) = 0, as in the following examples. 


Dependent but Uncorrelated Random Variables. Suppose that the random variable X 
can take only the three values —1, 0, and 1, and that each of these three values has the 
same probability. Also, let the random variable Y be defined by the relation Y = X?. 
We shall show that X and Y are dependent but uncorrelated. 
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Figure 4.10 The shaded 
region is where the joint p.d-f. 
of (X, Y) is constant and 
nonzero in Example 4.6.5. 
The vertical line indicates the 
values of Y that are possible 
when X = 0.5. 
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In this example, X and Y are clearly dependent, since Y is not constant and the 
value of Y is completely determined by the value of X. However, 


E(XY) = E(X3) = E(X) =0, 


because X°? is the same random variable as X. Since E(XY) = 0 and E(X)E(Y) =0, 
it follows from Theorem 4.6.1 that Cov(X, Y) =0 and that X and Y are uncorrelated. 
< 


Uniform Distribution Inside a Circle. Let (X, Y) have joint p.d.f. that is constant on 
the interior of the unit circle, the shaded region in Fig. 4.10. The constant value of 
the p.d.f. is one over the area of the circle, that is, 1/(27). It is clear that X and Y 
are dependent since the region where the joint p.d.f. is nonzero is not a rectangle. 
In particular, notice that the set of possible values for Y is the interval (—1, 1), but 
when X = 0.5, the set of possible values for Y is the smaller interval (—0.866, 0.866). 
The symmetry of the circle makes it clear that both X and Y have mean 0. Also, it is 
not difficult to see that E(XY) = ff xyf(x, y)dxdy = 0. To see this, notice that the 
integral of xy over the top half of the circle is exactly the negative of the integral of xy 
over the bottom half. Hence, Cov(X, Y) = 0, but the random variables are dependent. 

< 


The next result shows that if Y is a Jinear function of X, then X and Y must be 
correlated and, in fact, |o(X, Y)| =1. 


Suppose that X is a random variable such that 0 < oy < oo, and Y =aX +b for some 
constants a and b, wherea £0. Ifa > 0, then p(X, Y) = 1. Ifa <0, then p(X, Y) = -1. 


Proof If Y=aX +), then wy =amy +b and Y — wy =a(X — py). Therefore, by 
Eq. (4.6.1), 
Cov(X, Y) =aE[(X — px)"]=a02. 


Since oy = |alox, the theorem follows from Eq. (4.6.3). rT] 


There is a converse to Theorem 4.6.5. That is, |o(X, Y)| = 1 implies that X and 
Y are linearly related. (See Exercise 17.) In general, the value of p(X, Y) provides a 
measure of the extent to which two random variables X and Y are linearly related. If 
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the joint distribution of X and Y is relatively concentrated around a straight line in 
the xy-plane that has a positive slope, then p(X, Y) will typically be close to 1. If the 
joint distribution is relatively concentrated around a straight line that has a negative 
slope, then p(X, Y) will typically be close to —1. We shall not discuss these concepts 
further here, but we shall consider them again when the bivariate normal distribution 
is introduced and studied in Sec. 5.10. 


Note: Correlation Measures Only Linear Relationship. A large value of |p(X, Y)| 
means that X and Y are close to being linearly related and hence are closely related. 
But a small value of |o(X, Y)| does not mean that X and Y are not close to being 
related. Indeed, Example 4.6.4 illustrates random variables that are functionally 
related but have 0 correlation. 

We shall now determine the variance of the sum of random variables that are 
not necessarily independent. 


If X and Y are random variables such that Var(X) < oo and Var(Y) < oo, then 


Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). (4.6.11) 


Proof Since E(X + Y) =py + ry, then 
Var(X + Y) = E[(X + ¥ — py — fy)] 


= E[(X — wy)? + (¥ — wy)? +2(X — wy) — py)] 
= Var(X) + Var(Y) + 2 Cov(X, Y). | 


For all constants a and b, it can be shown that Cov(ax, bY) = ab Cov(X, Y) 
(see Exercise 5 at the end of this section). The following then follows easily from 
Theorem 4.6.6. 


Let a, b, and c be constants. Under the conditions of Theorem 4.6.6, 


Var(aX + bY +c) =a? Var(X) +b? Var(Y) + 2ab Cov(X, Y). (4.6.12) 
i 


A particularly useful special case of Corollary 4.6.1 is 


Var(X — Y) = Var(X) + Var(Y) — 2 Cov(X, Y). (4.6.13) 


Investment Portfolio. Consider, once again, the investor in Example 4.3.7 on page 230 
trying to choose a portfolio with $100,000 to invest. We shall make the same assump- 
tions about the returns on the two stocks, except that now we will suppose that the 
correlation between the two returns R, and R; is —0.3, reflecting a belief that the two 
stocks tend to react in opposite ways to common market forces. The variance of a 
portfolio of s; shares of the first stock, sy shares of the second stock, and s3 dollars 
invested at 3.6% is now 


Var(s,R1 + 52Rz + 0.03653) = 5587 + 2855 — 0.3755 x 28559. 


We continue to assume that (4.3.2) holds. Figure 4.11 shows the relationship between 
the mean and variance of the efficient portfolios in this example and Example 4.3.7. 
Notice how the variances are smaller in this example than in Example 4.3.7. This is 
due to the fact that the negative correlation lowers the variance of a linear combina- 
tion with positive coefficients. <l 


Theorem 4.6.6 can also be extended easily to the variance of the sum of n random 
variables, as follows. 
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Figure 4.11 Mean and vari- 
ance of efficient investment 
portfolios. 
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If X,,..., X, are random variables such that Var(X;) < co fori =1,...,7, then 
n nA 
Var { )° x; ) = )— Var(X;) +2 )° > Cov(X;, X)). (4.6.14) 
i=l i=l i<j 


Proof For every random variable Y, Cov(Y, Y) = Var(Y). Therefore, by using the 
result in Exercise 8 at the end of this section, we can obtain the following relation: 


n 


Var 3 x) = Cov > X;, 3 X;J= > 3 Cov ayy X 3: 
i=l i=1 j=l 


i=1 j=l 


We shall separate the final sum in this relation into two sums: (i) the sum of those 
terms for which i = j and (ii) the sum of those terms for which i 4 j. Then, if we use 
the fact that Cov(X;, X ;) = Cov(X;, X;), we obtain the relation 


Var (> x) = a Var(X;) + > Cov(X;, Xj) 
i=1 i=l 


ifj 
n 
= )5 Var(X;) +2 }° > Cov(X;, X;). r 
i=1 i<j 
The following is a simple corrolary to Theorem 4.6.7. 


If X;,..., X, are uncorrelated random variables (that is, if X; and X ; are uncorre- 
lated whenever i 4 /), then 


Var (>: x) = 5° Var(X;). (4.6.15) 
i=1 i=1 


Corollary 4.6.2 extends Theorem 4.3.5 on page 230, which states that (4.6.15) holds 
if X,,..., X, are independent random variables. 


Note: In General, Variances Add Only for Uncorrelated Random Variables. The 
variance of a sum of random variables should be calculated using Theorem 4.6.7 in 
general. Corollary 4.6.2 applies only for uncorrelated random variables. 


Summary 
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The covariance of X and Y is Cov(X, Y) = E{[X — E(X)][Y — E(Y)]}. The correlation 
is p(X, Y) = Cov(X, Y)/[Var(X) Var(Y)]'/?, and it measures the extent to which X 
and Y are linearly related. Indeed, X and Y are precisely linearly related if and only 
if |o(X, Y)| = 1. The variance of a sum of random variables can be expressed as the 
sum of the variances plus two times the sum of the covariances. The variance of a 
linear function is Var(aX + bY + c) =a? Var(X) + b? Var(Y) + 2ab Cov(X, Y). 


Exercises 


1. Suppose that the pair (X, Y) is uniformly distributed on 
the interior of a circle of radius 1. Compute p(X, Y). 


2. Prove that if Var(X) <oo and Var(Y) < oo, then 
Cov(X, Y) is finite. Hint: By considering the relation 
[(X — wy) + (Y — py) = 0, show that 


i 
(X = ux) Y — wy) S 51% jexy + (Y — py)’ ]. 


3. Suppose that X has the uniform distribution on the 
interval [—2, 2] and Y = X°. Show that X and Y are un- 
correlated. 


4. Suppose that the distribution of a random variable X is 
symmetric with respect to the point x = 0,0 < E(X*) <0, 
and Y = X2. Show that X and Y are uncorrelated. 


5. For all random variables X and Y and all constants a, 
b,c, and d, show that 


Cov(aX +b, cY +d) =ac Cov(X, Y). 


6. Let X and Y be random variables such that 0 < ag < 00 
and 0 < of < oo. Suppose that U =aX + band V=cY + 
d, where a £0 and c £0. Show that p(U, V) = p(X, Y) if 
ac > 0, and p(U, V) = —p(X, Y) ifac < 0. 


7. Let X, Y, and Z be three random variables such that 
Cov(X, Z) and Cov(Y, Z) exist, and let a, b, and c be 
arbitrary given constants. Show that 


Cov(aX + bY +c, Z) =a Cov(X, Z) +b Cov(Y, Z). 


8. Suppose that X),..., X,, and Y;,..., Y,, are random 
variables such that Cov(X;, Y;) exists fori =1,..., mand 
j=1,...,n,and suppose that qj,...,a,, and b,,...,b, 


are constants. Show that 


m n 
Cov Y>ajX;, 9) b;¥; 
i=1 j=l 


9. Suppose that X and Y are two random variables, which 
may be dependent, and Var(X) = Var(Y). Assuming that 
0 < Var(X + Y) < co and 0 < Var(X — Y) < co, show that 
the random variables X + Y and X — Y are uncorrelated. 


m n 


= > > a;b; Cov(X;, Yj). 


i=1 j=l 


10. Suppose that X and Y are negatively correlated. Is 
Var(X + Y) larger or smaller than Var(X — Y)? 


11. Show that two random variables X and Y cannot pos- 
sibly have the following properties: E(X) = 3, E(Y) = 2, 
E(X?) = 10, E(Y”) = 29, and E(XY) =0. 


12. Suppose that X and Y have a continuous joint distri- 
bution for which the joint p.d.f. is as follows: 


x(x+y) for0<x<1land0<y<z2, 


0 otherwise. 


f= | 


Determine the value of Var(2X — 3Y + 8). 


13. Suppose that X and Y are random variables such that 
Var(X) = 9, Var(Y) =4, and p(X, Y) = —1/6. Determine 
(a) Var(X + Y) and (b) Var(X — 3Y +4). 


14. Suppose that X, Y, and Z are three random variables 
such that Var(X) = 1, Var(Y) = 4, Var(Z) = 8, Cov(X, Y) 
= 1, Cov(X, Z) = —1, and Cov(Y, Z) = 2. Determine (a) 
Var(X + Y + Z) and (b) Var(3X — Y —2Z +1). 


15. Suppose that X,..., X,, are random variables such 
that the variance of each variable is 1 and the correlation 
between each pair of different variables is 1/4. Determine 
Var(X,+---+X,). 


16. Consider the investor in Example 4.2.3 on page 220. 
Suppose that the returns R, and R, on the two stocks 
have correlation —1. A portfolio will consist of s; shares 
of the first stock and s2 shares of the second stock where 
$1, Sy > 0. Find a portfolio such that the total cost of the 
portfolio is $6000 and the variance of the return is 0. Why 
is this situation unrealistic? 


17. Let X and Y be random variables with finite variance. 
Prove that |o(X, Y)| =1implies that there exist constants 
a,b, and c such that aX + bY =c with probability 1. Hint: 
Use Theorem 4.6.2 with U = X — wy and V = Y — py. 


18. Let X and Y have a continuous distribution with joint 
p.df. 


x+y for0<x<land0O<y<l, 
fon =| . 


0 otherwise. 


Compute the covariance Cov(X, Y). 
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4.7 Conditional Expectation 


Since expectations (including variances and covariances) are properties of distri- 
butions, there will exist conditional versions of all such distributional summaries 
as well as conditional versions of all theorems that we have proven or will later 
prove about expectations. In particular, suppose that we wish to predict one ran- 
dom variable Y using a function d(X) of another random variable X so as to 
minimize E([Y — d(X)f). Then d(X) should be the conditional mean of Y given 
X. There is also a very useful theorem that is an extension to expectations of the 
law of total probability. 


Definition and Basic Properties 


Household Survey. A collection of households were surveyed, and each household re- 
ported the number of members and the number of automobiles owned. The reported 
numbers are in Table 4.1. 

Suppose that we were to sample a household at random from those households 
in the survey and learn the number of members. What would then be the expected 
number of automobiles that they own? 4 


The question at the end of Example 4.7.1 is closely related to the conditional 
distribution of one random variable given the other, as defined in Sec. 3.6. 


Conditional Expectation/Mean. Let X and Y be random variables such that the mean 
of Y exists and is finite. The conditional expectation (or conditional mean) of Y given 
X =x is denoted by E(Y|x) and is defined to be the expectation of the conditional 
distribution of Y given X = x. 


For example, if Y has a continuous conditional distribution given X =x with 
conditional p.d-f. go(y|x), then 


E(Y |x) =| ygo(y|x) dy. (4.7.1) 


Similarly, if Y has a discrete conditional distribution given X = x with conditional p.f. 
82(y|x), then 


E(¥|x) = )> ygo(yla). (4.7.2) 


Ally 


Table 4.1 Reported numbers of household members and 
automobiles in Example 4.7.1 


Number of Number of members 


automobiles 1 2 3 4 > 6 7 8 


1 7 3 2 2 21 
12 21 25 30 25 15 
1 5 10 15 20 11 
O 2 38 3S S 3 
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The value of E(Y|x) will not be uniquely defined for those values of x such that 
the marginal p.f. or p.d-f. of X satisfies f,(~) = 0. However, since these values of x 
form a set of points whose probability is 0, the definition of E(Y|x) at such a point 
is irrelevant. (See Exercise 11 in Sec. 3.6.) It is also possible that there will be some 
values of x such that the mean of the conditional distribution of Y given X = x is 
undefined for those x values. When the mean of Y exists and is finite, the set of x 
values for which the conditional mean is undefined has probability 0. 

The expressions in Eqs. (4.7.1) and (4.7.2) are functions of x. These functions of 
x can be computed before X is observed, and this idea leads to the following useful 
concept. 


Conditional Means as Random Variables. Let h(x) stand for the function of x that is 
denoted E(Y|x) in either (4.7.1) or (4.7.2). Define the symbol E(Y|X) to mean h(X) 
and call it the conditional mean of Y given X. 


In other words, E(Y|X) is a random variable (a function of X) whose value when 
X =x is E(Y|x). Obviously, we could define E(X|Y) and E(X|y) analogously. 


Household Survey. Consider the household survey in Example 4.7.1. Let X be the 
number of members in a randomly selected household from the survey, and let Y be 
the number of cars owned by that household. The 250 surveyed households are all 
equally likely to be selected, so Pr(X =x, Y = y) is the number of households with 
x members and y cars, divided by 250. Those probabilities are reported in Table 4.2. 
Suppose that the sampled household has X = 4 members. The conditional p.f. of Y 
given X = 4 is go(y|4) = f (4, y)/f,(4), which is the x = 4 column of Table 4.2 divided 
by f,(4) = 0.208, namely, 


g7(0|4) = 0.0385, — go(1|4) = 0.5769, — _go(2|4) = 0.2885, —_go(3|4) = 0.0962. 
The conditional mean of Y given X = 4 is then 
E(Y|4) =0 x 0.0385 + 1 x 0.5769 + 2 x 0.2885 + 3 x 0.0962 = 1.442. 


Similarly, we can compute E(Y|x) for all eight values of x. They are 


x | 1 ,) 3 4 5 6 7 8 


E(Y |x) | 0.609 1.057 1.317 1.442 1.538 1.533 175 2 


Table 4.2 Joint p.f. f(x, y) of X and Y in Example 4.7.2 together with marginal 
p.f’s f,(x) and fo(y) 


y 1 2 3 4 5 6 7 8 fly) 


0.040 0.028 0.012 0.008 0.008 0.004 0 0 0.100 
0.048 0.084 0.100 0.120 0.100 0.060 0.020 0.004 0.536 
0.004 0.020 0.040 0.060 0.080 0.044 0.020 0.012 0.280 

3 0 0.008 0.012 0.020 0.020 0.012 0.008 0.004 0.084 
fi(x) 0.092 0.140 0.164 0.208 0.208 0.120 0.048 0.020 
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The random variable that takes the value 0.609 when the sampled household has one 
member, takes the value 1.057 when the sampled household has two members, and 
so on, is the random variable E(Y|X). <l 


A Clinical Trial. Consider a clinical trial in which a number of patients will be treated 
and each patient will have one of two possible outcomes: success or failure. Let P 
be the proportion of successes in a very large collection of patients, and let X; =1 
if the ith patient is a success and X; = 0 if not. Assume that the random variables 
X,, X,... are conditionally independent given P = p with Pr(xX; = 1|P = p) = p. 
Let X = X,+---+X,, which is the number of patients out of the first n who are 
successes. We now compute the conditional mean of X given P. The patients are 
independent and identically distributed conditional on P = p. Hence, the conditional 
distribution of X given P = p is the binomial distribution with parameters n and p. 
As we saw in Sec. 4.2, the mean of this binomial distribution is np, so E(X|p) =np 
and E(X|P) =nP. Later, we will show how to compute the conditional mean of P 
given X. This can be used to predict P after observing X. 4 


Note: The Conditional Mean of Y Given_X Is a Random Variable. Because E(Y|X) 
is a function of the random variable X, it is itself a random variable with its own 
probability distribution, which can be derived from the distribution of X. On the 
other hand, h(x) = E(Y|x) is a function of x that can be manipulated like any other 
function. The connection between the two is that when one substitutes the random 
variable X for x in h(x), the result is h(X) = E(Y|X). 

We shall now show that the mean of the random variable E(Y|X) must be E(Y). 
A similar calculation shows that the mean of E(X|Y) must be E(X). 


Law of Total Probability for Expectations. Let X and Y be random variables such that 
Y has finite mean. Then 
E[E(Y|X)]= E(Y). (4.7.3) 


Proof We shall assume, for convenience, that X and Y have a continuous joint 
distribution. Then 


Fler] = [ E(Y |x) fil) dx 


=i / ygo(y|x) fix) dy dx. 


Since go(y|x) = f(x, y)/f,(x), it follows that 


F{ErxO= [ i yf (x, y) dy dx = E(Y). 


The proof for a discrete distribution or a more general type of distribution is similar. 
7 


Household Survey. At the end of Example 4.7.2, we described the random variable 
E(Y |X). Its distribution can be constructed from that description. It has a discrete dis- 
tribution that takes the eight values of E (Y |x) listed near the end of that example with 
corresponding probabilities f,(x) for x =1,..., 8. To be specific, let Z = E(Y|X), 
then Pr[Z = E(Y|x)]= f,(x) for x =1,..., 8. The specific values are 
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Zz | 0.609 1.057 1.317 1442 1.538 = 1.533 175 2 


Pr(Z =z) | 0.092. 0.140 0.164 0.208 0.208 0.120 0.048 0.020 


We can compute E(Z) = 0.609 x 0.092 +---+2 x 0.020 = 1.348. The reader can 
verify that E(Y) = 1.348 by using the values of f,(y) in Table 4.2. | 


A Clinical Trial. In Example 4.7.3, we let X be the number of patients out of the 
first n who are successes. The conditional mean of X given P = p was computed as 
E(X|p) =np, where P is the proportion of successes in a large population of patients. 
If the distribution of P is uniform on the interval [0, 1], then the marginal expected 
value of X is E[E(X|P)]= E(™P) =n/2. We will see how to calculate E(P|X) in 
Example 4.7.8. < 


Choosing Points from Uniform Distributions. Suppose that a point X is chosen in 
accordance with the uniform distribution on the interval [0, 1]. Also, suppose that 
after the value X = x has been observed (0 < x < 1), a point Y is chosen in accordance 
with a uniform distribution on the interval [x, 1]. We shall determine the value 
of E(Y). 

For each given value of x (0 <x <1), E(Y|x) will be equal to the midpoint 
(1/2)(x + 1) of the interval [x, 1]. Therefore, E(Y|X) = (1/2)(X + 1) and 


BY) = BIB IX)]= 51600 + 1}= 3 (5 +1) =2. < 


When manipulating the conditional distribution given X = x, it is safe to act as if 
X is the constant x. This fact, which can simplify the calculation of certain conditional 
means, is now stated without proof. 


Let X and Y be random variables, and let Z = r(X, Y) for some function r. The 
conditional distribution of Z given X = x is the same as the conditional distribution 
of r(x, Y) given X =x. | 


One consequence of Theorem 4.7.2 when X and Y have a continuous joint 
distribution is that 
lo.e) 


E(Z\x) = E(r(x, Y)ix) = / 1 Euan 


Theorem 4.7.1 also implies that for two arbitrary random variables X and Y, 
E{E[r(X, Y)|X}} = E[r(X, Y)], (4.7.4) 


by letting Z =r(X, Y) and noting that E{E(Z|X)} = E(Z). 

We can define, in a similar manner, the conditional expectation of r(X, Y) given 
Y and the conditional expectation of a function r(X;,..., X,,) of several random 
variables given one or more of the variables X;,..., Xj. 


Linear Conditional Expectation. Suppose that E(Y|X) =aX + b for some constants a 
and b. We shall determine the value of E(XY) in terms of E(X) and E(X”). 

By Eq. (4.7.4), E(XY) = E[E(XY|X)]. Furthermore, since X is considered to be 
given and fixed in the conditional expectation, 


E(XY|X) = XE(Y|X) = X(aX +b) =aX? + dx. 
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Therefore, 


E(XY) = E(aX? + bX) =aE(X’) + bE(X). < 


The mean is not the only feature of a conditional distribution that is important 
enough to get its own name. 


Conditional Variance. For every given value x, let Var(Y |x) denote the variance of the 
conditional distribution of Y given that X = x. That is, 


Var(Y |x) = E{[Y — E(Y|x)P |x}. (4.7.5) 


We call Var(Y |x) the conditional variance of Y given X =x. 


The expression in Eq. (4.7.5) is once again a function v(x). We shall define 
Var(Y |X) to be v(X) and call it the conditional variance of Y given X. 


Note: Other Conditional Quantities. In much the same way as in Definitions 4.7.1 
and 4.7.3, we could define any conditional summary of a distribution that we wish. For 
example, conditional quantiles of Y given X = x are the quantiles of the conditional 
distribution of Y given X = x. The conditional m.g.f. of Y given X = x is the m.g.f. of 
the conditional distribution of Y given X = x, etc. 


Prediction 


At the end of Example 4.7.3, we considered the problem of predicting the proportion 
P of successes in a large population of patients given the observed number X of 
succeses in a sample of size n. In general, consider two arbitrary random variables X 
and Y that have a specified joint distribution and suppose that after the value of X 
has been observed, the value of Y must be predicted. In other words, the predicted 
value of Y can depend on the value of X. We shall assume that this predicted value 
d(X) must be chosen so as to minimize the mean squared error E{[Y — d(X ee 


The prediction d(X) that minimizes E{[Y — d(X)}} is d(X) = E(Y|X). 


Proof We shall prove the theorem in the case in which X has a continuous distri- 
bution, but the proof in the discrete case is virtually identical. Let d(X) = E(Y|X), 
and let d*(X) be an arbitrary predictor. We need only prove that E{[Y — d(X)P} < 
E{[Y — d*(X)f}. It follows from Eq. (4.7.4) that 


E{[Y — d(X)P} = E(E{ly — d(X)P |X). (4.7.6) 


A similar equation holds for d*. Let Z = [Y — d(X)f, and let h(x) = E(Z|x). Sim- 
ilarly, let Z* = [Y — d*(X)f and h*(x) = E(Z*|x). The right-hand side of (4.7.6) is 
J h(x) fx) dx, and the corresponding expression using d* is [ h*(x) f(x) dx. So, the 
proof will be complete if we can prove that 


/ h(x) fi(x) dx < / h* (x) f(x) dx. (4.7.7) 


Clearly, Eq. (4.7.7) holds if we can show that h(x) < h*(x) for all x. That is, the proof 
is complete if we can show that E{[Y — d(X)|x} < E{[Y — d*(X) |x}. When we 
condition on X = x, we are allowed to treat X as if it were the constant x, so we need 
to show that E{[Y — d(x)P |x} < E{[Y — d*(x)} |x}. These last expressions are nothing 
more than the M.S.E.’s for two different predictions d(x) and d*(x) of Y calculated 
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using the conditional distribution of Y given X = x. As discussed in Sec. 4.5, the 
M.S.E. of such a prediction is smallest if the prediction is the mean of the distribution 
of Y. In this case, that mean is the mean of the conditional distribution of Y given 
X =x. Since d(x) is the mean of the conditional distribution of Y given X = x, it must 
have smaller M.S.E. than every other prediction d*(x). Hence, h(x) < h*(x) for all x. 

rT 


If the value X =x is observed and the value E(Y|x) is predicted for Y, then 
the M.S.E. of this predicted value will be Var(Y |x), from Definition 4.7.3. It follows 
from Eq. (4.7.6) that if the prediction is to be made by using the function d(X) = 
E(Y|X), then the overall M.S.E., averaged over all the possible values of X, will be 
E[Var(Y|X)]. 

If the value of Y must be predicted without any information about the value of 
X, then, as shown in Sec. 4.5, the best prediction is the mean E(Y) and the M.S.E. 
is Var(Y). However, if X can be observed before the prediction is made, the best 
prediction is d(X) = E(Y|X) and the M.S.E. is E[Var(Y|X)]. Thus, the reduction in 
the M.S.E. that can be achieved by using the observation X is 


Var(Y) — E[Var(Y|X)]. (4.7.8) 


This reduction provides a measure of the usefulness of X in predicting Y. It is shown 
in Exercise 11 at the end of this section that this reduction can also be expressed as 
Var[E(Y|X)]. 

It is important to distinguish carefully between the overall M.S.E., which is 
E[Var(¥|X)], and the M.S.E. of the particular prediction to be made when X = x, 
which is Var(¥|x). Before the value of X has been observed, the appropriate value 
for the M.S.E. of the complete process of observing X and then predicting Y is 
E[Var(Y|X)]. After a particular value x of X has been observed and the prediction 
E(Y|x) has been made, the appropriate measure of the M.S.E. of this prediction is 
Var(Y |x). A useful relationship between these values is given in the following result, 
whose proof is left to Exercise 11. 


Law of Total Probability for Variances. If X and Y are arbitrary random variables for 
which the necessary expectations and variances exist, then Var(Y) = E[Var(Y|X)]+ 
Var[E(Y|X)]. r 


A Clinical Trial. In Example 4.7.3, let X be the number of patients out of the first 
40 in a clinical trial who have success as their outcome. Let P be the probability 
that an individual patient is a success. Suppose that P has the uniform distribution 
on the interval [0, 1] before the trial begins, and suppose that the outcomes of the 
patients are conditionally independent given P = p. As we saw in Example 4.7.3, X 
has the binomial distribution with parameters 40 and p given P = p. If we needed to 
minimize M.S.E. in predicting P before observing X, we would use the mean of P, 
namely, 1/2. The M.S.E. would be Var(P) = 1/12. However, we shall soon observe the 
value of X and then predict P. To do this, we shall need the conditional distribution 
of P given X =x. Bayes’ theorem for random variables (3.6.13) tells us that the 
conditional p.d.f. of P given X = x is 


su(vlp) fl) (479) 
fi) 
where g1(x|p) is the conditional pf. of X given P = p, namely, the binomial p.f. 


gi(xlp) = (*°) p*(1— p)** for x =0,..., 40, fo(p) = 1 for 0 < p < Lis the marginal 
p.d.f. of P, and f,(x) is the marginal p.f. of X obtained from the law of total probability 


8o(p|x) = 
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Figure 4.12 The conditional 
p.d.f. of P given X = 18 in 
Example 4.7.8. The marginal 
p.d.f. of P (prior to observing 
X) is also shown. 


Marginal 


5 Fy freeeeeeee Conditional 


for random variables (3.6.12): 


: 40 x 40—x 
fi) = [ (“)e (1 p)"-* dp. (4.7.10) 


This last integral looks difficult to compute. However, there is a simple formula for 
integrals of this form, namely, 


a j ke! 
: p’(— p)' dp= ay EE (4.7.11) 
A proof of Eq. (4.7.11) is given in Sec. 5.8. Substituting (4.7.11) into (4.7.10) yields 
40! x(40—x)!_ 1 


A@=Tao-n! 41 41’ 
for x =0,..., 40. Substituting this into Eq. (4.7.9) yields 
4, ate 
= eal * for 0 il; 
82(plx) pao =i (=p) =p = 


For example, with x = 18, the observed number of successes in Table 2.1, a graph of 
g(p|18) is shown in Fig. 4.12. 

If we want to minimize the M.S.E. when predicting P, we should use E(P|x), 
the conditional mean. We can compute E(P|x) using the conditional p.d.f. and 
Eq. (4.7.11): 


1 
41! 
E(P|x) = i; p'(l = p)-* dp 


Dp 
! = ! 
x!(40 — x)! (4.7.12) 
— 4 @ t+ D40—x)! x 41 
~ x1(40 — x)! 42! a? 


So, after X = x is observed, we will predict P to be (x + 1)/42, which is very close to 
the proportion of the first 40 patients who are successes. The M.S.E. after observing 
X =x is the conditional variance Var(P|x). We can compute this using (4.7.12) and 


BPD =f’ at p*(1 = py" dp 
0 x!(40 — x)! 


_ 41! (e240 = x)! —. Gee de 2) 
~ x1(40 — x)! 43! ~ 42x 43 
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Using the fact that Var(P|x) = E(P?|x) — [E(P|x)[, we see that 
(+ DG41— x) 
422x430 
The overall M.S.E. of predicting P from X is the mean of the conditional M.S.E. 


(X + 1)(41- ~) 
42? x 43 


Var(P |x) = 


E[Var(P|X)]= e( 


1 2 
= E(-X 40X + 41 
75,852 ( an ve 


40 40 
tf AS 40 
=——(-— ~~ 4 
al a a 2 
1 ( 140 x 41x81 , 4040 x 41 ) 
= + +41 
75,852\ 416 41 
301 
~ 75,852 


In this calculation, we used two popular formulas, 


= 0.003968. 


yik= oe D (4.7.13) 
k=0 
yea es (4.7.14) 
k=0 


The overall M.S.E. is quite a bit smaller than the value 1/12 = 0.08333, which we 
would have obtained before observing X. As an illustration, Fig. 4.12 shows how 
much more spread out the marginal distribution of P is compared to the conditional 
distribution of P after observing X = 18. < 


It should be emphasized that for the conditions of Example 4.7.8, 0.003968 is the 
appropriate value of the overall M.S.E. when it is known that the value of X will be 
available for predicting P but before the explicit value of X has been determined. 
After the value of X = x has been determined, the appropriate value of the M.S.E. is 
Var(P|x) = He. Notice that the largest possible value of Var(P |x) is 0.005814 
when x = 20 and is still much less than 1/12. 

A result similar to Theorem 4.7.3 holds if we are trying to minimize the M.A.E. 
(mean absolute error) of our prediction rather than the M.S.E. In Exercise 16, you 
can prove that the predictor that minimizes M.A.E. is d(X) equal to the median of 
the conditional distribution of Y given X. 


Summary 


The conditional mean E(Y|x) of Y given X =x is the mean of the conditional 
distribution of Y given X = x. This conditional distribution was defined in Chapter 3. 
Likewise, the conditional variance Var(Y|x) of Y given X =x is the variance of 
the conditional distribution. The law of total probability for expectations says that 
E[E(Y|X)|= E(Y). If we will observe X and then need to predict Y, the predictor 
that leads to the smallest M.S.E. is the conditional mean E(Y|X). 
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Exercises 


1. Consider again the situation described in Example 
4.7.8. Compute the M.S.E. when using E(P|x) to predict 
P after observing X = 18. How much smaller is this than 
the marginal M.S.E. 1/12? 


2. Suppose that 20 percent of the students who took a 
certain test were from school A and that the arithmetic 
average of their scores on the test was 80. Suppose also 
that 30 percent of the students were from school B and that 
the arithmetic average of their scores was 76. Suppose, 
finally, that the other 50 percent of the students were from 
school C and that the arithmetic average of their scores 
was 84. If a student is selected at random from the entire 
group that took the test, what is the expected value of her 
score? 


3. Suppose that 0 < Var(X) < oo and 0 < Var(Y) < oo. 
Show that if E(X|Y) is constant for all values of Y, then X 
and Y are uncorrelated. 


4. Suppose that the distribution of X is symmetric with 
respect to the point x = 0, that all moments of X exist, and 
that E(Y|X) =aX + b, where a and D are given constants. 
Show that X2”" and Y are uncorrelated form =1,2,.... 


5. Suppose that a point X, is chosen from the uniform 
distribution on the interval [0, 1], and that after the value 
X, =x, is observed, a point X> is chosen from a uniform 
distribution on the interval [x;, 1]. Suppose further that 
additional variables X3, X4, ... are generated in the same 
way. In general, for j =1,2,..., after the value Xj= 
x; has been observed, Xj; is chosen from a uniform 
distribution on the interval [x ;, 1]. Find the value of E(X,). 


6. Suppose that the joint distribution of X and Y is the uni- 
form distribution on the circle x? + y? < 1. Find E(X|Y). 


7. Suppose that X and Y have a continuous joint distribu- 
tion for which the joint p.d-f. is as follows: 


x+y for0<x<landO0<y<l, 
0 otherwise. 


fon =| 


Find £(Y|X) and Var(Y|X). 


8. Consider again the conditions of Exercise 7. (a) If it 
is observed that X = 1/2, what predicted value of Y will 
have the smallest M.S.E.? (b) What will be the value of 
this M.S.E.? 


9. Consider again the conditions of Exercise 7. If the value 
of Y is to be predicted from the value of X, what will be 
the minimum value of the overall M.S.E.? 


10. Suppose that, for the conditions in Exercises 7 and 9, 
a person either can pay a cost c for the opportunity of 
observing the value of X before predicting the value of Y 


or can simply predict the value of Y without first observing 
the value of X. If the person considers her total loss to be 
the cost c plus the M.S.E. of her predicted value, what is 
the maximum value of c that she should be willing to pay? 


11. Prove Theorem 4.7.4. 


12. Suppose that X and Y are random variables such that 
E(Y|X) =aX +b. Assuming that Cov(X, Y) exists and 
that 0 < Var(X) < oo, determine expressions for a and b 
in terms of E(X), E(Y), Var(X), and Cov(X, Y). 


13. Suppose that a person’s score X on a mathematics 
aptitude test is a number in the interval (0, 1) and that 
his score Y on a music aptitude test is also a number in 
the interval (0, 1). Suppose also that in the population of 
all college students in the United States, the scores X and 
Y are distributed in accordance with the following joint 
p.d.f.: 


F(x, =| 2(2x +3y) for0<x<land0<y<1, 
0 otherwise. 


a. Ifa college student is selected at random, what pre- 
dicted value of his score on the music test has the 
smallest M.S.E.? 


b. What predicted value of his score on the mathematics 
test has the smallest M.A.E.? 


14. Consider again the conditions of Exercise 13. Are the 
scores of college students on the mathematics test and the 
music test positively correlated, negatively correlated, or 
uncorrelated? 


15. Consider again the conditions of Exercise 13. (a) Ifa 
student’s score on the mathematics test is 0.8, what pre- 
dicted value of his score on the music test has the smallest 
M.S.E.? (b) If a student’s score on the music test is 1/3, 
what predicted value of his score on the mathematics test 
has the smallest M.A.E.? 


16. Define a conditional median of Y given X = x to be 
any median of the conditional distribution of Y given X = 
x. Suppose that we will get to observe X and then we will 
need to predict Y. Suppose that we wish to choose our 
prediction d(X) so as to minimize mean absolute error, 
E(\Y — d(X)|). Prove that d(x) should be chosen to be 
a conditional median of Y given X =x. Hint: You can 
modify the proof of Theorem 4.7.3 to handle this case. 


17. Prove Theorem 4.7.2 for the case in which X and Y 
have a discrete joint distribution. The key to the proof is 
to write all of the necessary conditional p.f.’s in terms of 
the joint p.f. of X and Y and the marginal p.f. of X. To 
facilitate this, for each x and z, give a name to the set of y 
values such that r(x, y) =z. 
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*4.8 Utility 


Much of statistical inference consists of choosing between several available actions. 
Generally, we do not know for certain which choice will be best, because some 
important random variable has not yet been observed. For some values of that 
random variable one choice is best, and for other values some other choice is 
best. We can try to weigh the costs and benefits of the various choices against the 
probabilities that the various choices turn out to be best. Utility is one tool for 
assigning values to the costs and benefits of our choices. The expected value of the 
utility then balances the costs and benefits according to how likely the uncertain 
possibilities are. 


Utility Functions 


Choice of Gambles. Consider two gambles between which a gambler must choose. 
Each gamble will be expressed as a random variable for which positive values mean 
a gain to the gambler and negative values mean a loss to the gambler. The numerical 
values of each random variable tell the number of dollars that the gambler gains or 
loses. Let X have the p.f. 


0.5 if x =500 or x = —350, 
f@= ; 
0 otherwise, 
and let Y have the p.f. 
1/3 if y=40, y = 50, or y = 60, 
sy) = ; 
0 otherwise, 


It is simple to compute that E(X) = 75 and E(Y) = 50. How might a gambler choose 
between these two gambles? Is X better than Y simply because it has higher expected 
value? <l 


In Example 4.8.1, a gambler who does not desire to risk losing 350 dollars for the 
chance of winning 500 dollars might prefer Y, which yields a certain gain of at least 
40 dollars. 

The theory of utility was developed during the 1930s and 1940s to describe a 
person’s preference among gambles like those in Example 4.8.1. According to that 
theory, a person will prefer a gamble X for which the expectation of a certain 
function U(X) is a maximum, rather than a gamble for which simply the expected 
gain E(X) is a maximum. 


Utility Function. A person’s utility function U is a function that assigns to each pos- 
sible amount x (—oo < x < oo) anumber U(x) representing the actual worth to the 
person of gaining the amount x. 


Choice of Gambles. Suppose that a person’s utility function is U and that she must 
choose between the gambles X and Y in Example 4.8.1. Then 


E[U(X)]= 50 (500) + 5U(-350) (4.8.1) 
and 


E[U(Y)]= =U (60) oe 50650) " =U (40). (4.8.2) 
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Figure 4.13 The utility 
function for Example 4.8.2. 


Definition 
4.8.2 


The person would prefer the gamble for which the expected utility of the gain, as 
specified by Eq. (4.8.1) or Eq. (4.8.2), is larger. 

As a specific example, consider the following utility function that penalizes losses 
to a much greater extent than it rewards gains: 


100 log(x + 100) — 461 if x >0, 


U(x) = | . (4.8.3) 


if x <0. 

This function was chosen to be differentiable at x = 0, continuous everywhere, in- 
creasing, concave for x > 0, and linear for x < 0. A graph of U(x) is given in Fig. 4.13. 
Using this specific U, we compute 


E[U(X)]= 5{100 log(600) — 461] + 5(-350) — —85.4, 


E[uU(Y)]= 5100 log(160) — 461] + 5100 log(150) — 461] + 5100 log(140) — 461] 


= 40.4. 


We see that a person with the utility function in Eq. (4.8.3) would prefer Y to X. 
< 


Here, we formalize the principle that underlies the choice between gambles 
illustrated in Example 4.8.1. 


Maximizing Expected Utility. We say that a person chooses between gambles by 
maximizing expected utility if the following conditions hold. There is a utility function 
U, and when the person must choose between any two gambles X and Y, he will 
prefer X to Y if E[U(X)]> E[U(Y)] and will be indifferent between X and Y if 
E[U(X)]= E[U(Y)}. 


In words, Definition 4.8.2 says that a person chooses between gambles by maximizing 
expected utility if he will choose a gamble X for which E[U(X)] is a maximum. 

If one adopts a utility function, then one can (at least in principle) make choices 
between gambles by maximizing expected utility. The computational algorithms nec- 
essary to perform the maximization often provide a practical challenge. Conversely, 
if one makes choices between gambles in such a way that certain reasonable criteria 
apply, then one can prove that there exists a utility function such that the choices 
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correspond to maximizing expected utility. We shall not consider this latter prob- 
lem in detail here; however, it is discussed by DeGroot (1970) and Schervish (1995, 
chapter 3) along with other aspects of the theory of utility. 


Examples of Utility Functions 


Since it is reasonable to assume that every person prefers a larger gain to a smaller 
gain, we shall assume that every utility function U(x) is an increasing function of 
the gain x. However, the shape of the function U(x) will vary from person to person 
and will depend on each person’s willingness to risk losses of various amounts in 
attempting to increase his gains. 

For example, consider two gambles X and Y for which the gains have the follow- 
ing probability distributions: 


Pr(xX =—3)=0.5, Pr(X =2.5)=0.4, Pr(x =6) =0.1 (4.8.4) 
and 
Pr(¥Y = —2)=0.3,  Pr(Y=1)=0.4, Pr(Y =3) =0.3. (4.8.5) 


We shall assume that a person must choose one of the following three decisions: 
(i) accept gamble X, (ii) accept gamble Y, or (iii) do not accept either gamble. We 
shall now determine the decision that a person would choose for three different utility 
functions. 


Linear Utility Function. Suppose that U(x) = ax + b for some constants a and b, where 
a > 0. In this case, for every gamble X, E[U(X)]=aE(X) + b. Hence, for every two 
gambles X and Y, E[U(X)]> E|[U(Y)]if and only if E(X) > E(Y). In other words, a 
person who has a linear utility function will always choose a gamble for which the 
expected gain is a maximum. 

When the gambles X and Y are defined by Eqs. (4.8.4) and (4.8.5), 


E(X) = (0.5)(—3) + (0.4)(2.5) + (0.1) (6) = 0.1 
and 
E(Y) = (0.3)(—2) + (0.4)() + (0.3)(3) = 0.7. 


Furthermore, since the gain from not accepting either of these gambles is 0, the 
expected gain from choosing not to accept either gamble is clearly 0. Since E(Y) > 
E(X) > 0, it follows that a person who has a linear utility function would choose to 
accept gamble Y. If gamble Y were not available, then the person would prefer to 
accept gamble X rather than not to gamble at all. < 


Cubic Utility Function. Suppose that a person’s utility function is U(x) = x? for —oo < 
x < co. Then for the gambles defined by Eqs. (4.8.4) and (4.8.5), 
E[U(X)] = (0.5)(—3)° + (0.4) (2.5)? + (0.1) (6)? = 14.35 
and 
E[U(Y)] = (0.3)(—2)° + (0.4)(1)? + (0.3)(3)? = 6.1. 


Furthermore, the utility of not accepting either gamble is U(0) = 0* =0. Since 
E[U(X)]> E[U(Y)] > 0, it follows that the person would choose to accept gamble X. 
If gamble X were not available, the person would prefer to accept gamble Y rather 
than not to gamble at all. < 
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Logarithmic Utility Function. Suppose that a person’s utility function is U (x) = log(x + 
4) for x > —4. Since lim,_, 4 log(x + 4) = —oo, a person who has this utility function 
cannot choose a gamble in which there is any possibility of her gain being —4 or less. 
For the gambles X and Y defined by Eqs. (4.8.4) and (4.8.5), 


E[U(X)] = (0.5) (log 1) + (0.4) (log 6.5) + (0.1) log 10) = 0.9790 
and 
E[U(Y)]= (0.3) (log 2) + (0.4) log 5) + (0.3) (log 7) = 1.4355. 


Furthermore, the utility of not accepting either gamble is U (0) = log 4 = 1.3863. Since 
E[U(Y)] > U(O) > E[U(X)], it follows that the person would choose to accept gamble 
Y.If gamble Y were not available, the person would prefer not to gamble at all rather 
than to accept gamble X. < 


Selling a Lottery Ticket 


Suppose that a person has a lottery ticket from which she will receive a random gain 
of X dollars, where X has a specified probability distribution. We shall determine the 
number of dollars for which the person would be willing to sell this lottery ticket. 
Let U denote the person’s utility function. Then the expected utility of her gain 
from the lottery ticket is E[U (X)]. If she sells the lottery ticket for x9 dollars, then her 
gain is xq dollars, and the utility of this gain is U (xg). The person would prefer to accept 
Xq dollars as a certain gain rather than accept the random gain X from the lottery 
ticket if and only if U(xp) > E[U(X)]. Hence, the person would be willing to sell the 
lottery ticket for any amount xq such that U(xp) > E[U(X)]. If U(xo) = E[U (X)], she 
would be equally willing to either sell the lottery ticket or accept the random gain X. 


Quadratic Utility Function. Suppose that U(x) = x for x > 0, and suppose that the 
person has a lottery ticket from which she will win either 36 dollars with probability 
1/4 or 0 dollars with probability 3/4. For how many dollars xg would she be willing to 
sell this lottery ticket? 

The expected utility of the gain from the lottery ticket is 


E[U(X)]= 506) + =U(0) = 7036) + (0) = 374. 


Therefore, the person would be willing to sell the lottery ticket for any amount xo 
such that U(xp) = xe > 324. Hence, xq > 18. In other words, although the expected 
gain from the lottery ticket in this example is only 9 dollars, the person would not 
sell the ticket for less than 18 dollars. < 


Square Root Utility Function. Suppose now that U(x) = x!/* for x > 0, and consider 
again the lottery ticket described in Example 4.8.6. The expected utility of the gain 
from the lottery ticket in this case is 


E[U(X)]= GU GO) + [UO = 76) + 5) =15. 


Therefore, the person would be willing to sell the lottery ticket for any amount xo 
such that U (x9) = at 7s 5, Hence, xg > 2.25. In other words, although the expected 
gain from the lottery ticket in this example is 9 dollars, the person would be willing 
to sell the ticket for as little as 2.25 dollars. < 


Example 
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Some Statistical Decision Problems 


Much of the theory of statistical inference (the subject of Chapters 7-11 of this 
text) deals with problems in which one has to make one of several available choices. 
Generally, which choice is best depends on some random variable that has not yet 
been observed. One example was already discussed in Sec. 4.5, where we introduced 
the mean squared error (M.S.E.) and mean absolute error (M.A.E.) criteria for 
predicting a random variable. In these cases, we have to choose a number d for our 
prediction of a random variable Y. Which prediction will be best depends on the 
value of Y that we do not yet know. Random variables like —|Y — d| and —(Y — d)? 
are gambles, and the choice of gamble that minimizes M.A.E. or M.S.E. is the choice 
that maximizes an expected utility. 


Predicting a Random Variable. Suppose that Y is a random variable that we need 
to predict. For each possible prediction d, there is a gamble Xj; = —|Y —d| that 
specifies our gain when we are being judged by absolute error. Alternatively, if we 
are being judged by squared error, the appropriate gamble to consider would be 
Z,=—(Y —d)*. Notice that these gambles are always negative, meaning that our 
gain is negative because we lose according to how far Y is from the prediction d. If our 
utility U is linear, then maximizing E[U (X,)] by choice of d is the same as minimizing 
M.A.E. Also, maximizing E[U(Z,)] by choice of d is the same as minimizing M.S.E. 
The equivalence between maximizing expected utility and minimizing the mean error 
would continue to hold if the prediction were allowed to depend on another random 
variable W that we could observe before predicting. That is, our prediction would be 
a function d(W), and Xj = —|Y — d(W)| or Z; = —[Y — d(W) would be the gamble 
whose expected utility we would want to compute. 4 


Bounding a Random Variable. Suppose that Y is a random variable and that we are 
interested in whether or not Y <c for some constant c. For example, Y could be 
the random variable P in our clinical trial Example 4.7.3. We might be interested in 
whether or not P < po, where pg is the probability that a patient will be a success 
without any help from the treatment being studied. Suppose that we have to make 
one of two available decisions: 


(t) continue to promote the treatment, or 
(a) abandon the treatment. 


If we choose ft, suppose that we stand to gain 


4} 10° if P> po, 
‘~~ ) -10° if P < pp. 


If we choose a, our gain will be X, = 0. If our utility function is U, then the expected 
utility for choosing t is E[U(X,)], and t would be the better choice if this value is 
greater than U(0). For example, suppose that our utility is 


x98 ifx>0, 


U(x) = ‘ (4.8.6) 


if x <0. 
Then U(0) =0 and 
E[U(X,)] = —10° Pr(P < po) + [10°]"* Pr(P > po) 
= 1048 — (10° + 10*8) Pr(P < pp). 
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So, E[U(X,)] > Oif Pr(P < po) < 10*8/(10° + 10*%) = 0.0594. It makes sense that t is 
better than a if Pr(P < po) is small. The reason is that the utility of choosing ft over a 
is only positive when P > pp. This example is in the spirit of hypothesis testing, which 
will be the subject of Chapter 9. < 


Investment. In Example 4.2.2, we compared two possible stock purchases based 
on their expected returns and value at risk, VaR. Suppose that the investor has a 
nonlinear utility function for dollars. To be specific, suppose that the utility of a return 
of x would equal U(x) given in Eq. (4.8.6). We can calculate the expected utility of 
the return from each of the two possible stock purchases in Example 4.2.2 to decide 
which is more favorable. If R is the return per share and we buy s shares, then the 
return is X = sR, and the expected utility of the return is 


0 


E[U(sR)]= / srf(r) dr + / (sr)? f(r) dr, (4.8.7) 
oo 0 


where f is the p.d.f. of R. For the first stock, the return per share is R, distributed 
uniformly on the interval [—10, 20], and the number of shares would be s; = 120. This 
makes (4.8.7) equal to 


—12.6. 


20 0.8 
120° [ (120r)°8 
30 30 


For the second stock, the return per share is R, distributed uniformly on the interval 
[—4.5, 10], and the number of shares would be s = 200. This makes (4.8.7) equal to 


0 10 0.8 
E[U (200R})] = / oY ae / _ 
_45 14.5 0 14.5 


With this utility function, the expected utility of the first stock purchase is actually 
negative because the big gains (up to 120 x 20 = 2400) add less to the utility (2400°8 = 
506) than the big losses (up to 120 x —10 = —1200) take away from the utility. The 
second stock purchase has positive expected utility, so it would be the preferred 
choice in this example. < 


0 
E[U(120R,)] = / 7 


dr = 27.9. 


Summary 


When we have to make choices in the face of uncertainty, we need to assess what our 
gains and losses will be under each of the uncertain possibilities. Utility is the value 
to us of those gains and losses. For example, if X represents the random gain from 
a possible choice, then U(X) is the value to us of the random gain we would receive 
if we were to make that choice. We should make the choice such that E[U(X)] is as 
large as possible. 


1. Let a > 0. A decision maker has a utility function for Suppose that this decision maker is trying to decide 


money of the form 


uo={*" 


whether or not to buy a lottery ticket for $1. The lottery 

ticket pays $500 with probability 0.001, and it pays $0 with 

probability 0.999. What would the values of w have to be 

ifx <0. in order for this decision maker to prefer buying the ticket 
to not buying it? 


if x > 0, 


2. Consider three gambles X, Y, and Z for which the 
probability distributions of the gains are as follows: 


Pry = 5) = Prix 25) = 1/2. 
Pr(¥ = 10) = Pr(Y = 20) = 1/2, 
Pr(Z = 15) =1. 


Suppose that a person’s utility function has the form 
U(x) =x? for x > 0. Which of the three gambles would she 
prefer? 


3. Determine which of the three gambles in Exercise 2 
would be preferred by a person whose utility function is 
U(x) = x!/? for x > 0. 


4. Determine which of the three gambles in Exercise 2 
would be preferred by a person whose utility function 
has the form U(x) = ax + b, where a and b are constants 
(a> 0). 


5. Consider a utility function U for which U(0) = 0 and 
U(100) = 1. Suppose that a person who has this utility 
function is indifferent to either accepting a gamble from 
which his gain will be 0 dollars with probability 1/3 or 100 
dollars with probability 2/3 or accepting 50 dollars as a 
sure thing. What is the value of U(50)? 


6. Consider a utility function U for which U(0) =S, 
U(1) = 8, and U(2) = 10. Suppose that a person who has 
this utility function is indifferent to either of two gambles 
X and Y, for which the probability distributions of the 
gains are as follows: 


Pr(Xx = —1) = 0.6, Pr(X = 0) = 0.2, Pr(X = 2) = 0.2; 
Pr(Y =0)=0.9,. Pr(Y = 1) = 0.1. 


What is the value of U(—1)? 


7. Suppose that a person must accept a gamble X of the 
following form: 


Pr(xX¥ =a)=p and Pr(ix=1—a)=1-p, 


where p is a given number such that 0 < p < 1. Suppose 
also that the person can choose and fix the value of a 
(0 <a <1) tobe used in this gamble. Determine the value 
of a that the person would choose if his utility function 
was U(x) = log x for x > 0. 


8. Determine the value of a that a person would choose in 
Exercise 7 if his utility function was U(x) = x!/2 for x > 0. 


9. Determine the value of a that a person would choose 
in Exercise 7 if his utility function was U(x) =x for x > 0. 


10. Consider four gambles X1, X2, X3, and X4, for which 
the probability distributions of the gains are as follows: 
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Pr(X; = 0) = 0.2, Pr(X,;=1) =0.5, Pr(X; = 2) =0.3; 
Pr(X7 = 0) = 0.4, Pr(X, =1) =0.2, Pr(X, =2) =0.4; 
Pr(Xx3 = 0) = 0.3, Pr(X3 = 1) =0.3, Pr(X3=2) = 0.4; 
Pr(X4 = 0) = Pr(X4 = 2) =0.5. 


Suppose that a person’s utility function is such that she 
prefers X, to X>. If the person were forced to accept either 
X3 or X4, which one would she choose? 


11. Suppose that a person has a given fortune A > 0 and 
can bet any amount b of this fortune in a certain game 
(0 <b < A). If he wins the bet, then his fortune becomes 
A +b; if he loses the bet, then his fortune becomes A — b. 
In general, let X denote his fortune after he has won or 
lost. Assume that the probability of his winning is p (0 < 
p <1) and the probability of his losing is 1 — p. Assume 
also that his utility function, as a function of his final for- 
tune x, is U(x) = log x for x > 0. If the person wishes to 
bet an amount b for which the expected utility of his for- 
tune E[U(X)] will be a maximum, what amount b should 
he bet? 


12. Determine the amount b that the person should bet in 
Exercise 11 if his utility function is U(x) = x!/2 for x > 0. 


13. Determine the amount b that the person should bet in 
Exercise 11 if his utility function is U(x) = x for x > 0. 


14. Determine the amount b that the person should bet in 
Exercise 11 if his utility function is U(x) = x? for x > 0. 


15. Suppose that a person has a lottery ticket from which 
she will win X dollars, where X has the uniform distribu- 
tion on the interval [0, 4]. Suppose also that the person’s 
utility function is U(x) = x® for x > 0, where @ is a given 
positive constant. For how many dollars x9 would the per- 
son be willing to sell this lottery ticket? 


16. Let Y be arandom variable that we would like to pre- 
dict. Suppose that we must choose a single number d as the 
prediction and that we will lose (Y — d)? dollars. Suppose 
that our utility for dollars is a square root function: 


v= {| 
= 


Prove that the value of d that maximizes expected utility 
is a median of the distribution of Y. 


if x > 0, 


ifx <0. 


17. Reconsider the conditions of Example 4.8.9. This 
time, suppose that pp = 1/2 and 


x99 if x > 0, 


U(x) = 
x if x <0. 

Suppose also that P has p.d.f. f (p) = 56p°(1 — p) for 0 < 

p <1. Decide whether or not it is better to abandon the 

treatment. 
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1. Suppose that the random variable X has a continuous 
distribution with c.d.f. F(x) and p.d.f. f. Suppose also that 
E(X) exists. Prove that 


iim x[1— F(x)]=0. 


Hint: Use the fact that if E(X) exists, then 
E(X)= lim / xf (x) dx. 
U>eo Joo 


2. Suppose that the random variable X has a continuous 
distribution with c.d.f. F(x). Suppose also that Pr(X > 0) = 
1 and that E(X) exists. Show that 


E(X)= fu — F(x)|dx. 
0 


Hint: You may use the result proven in Exercise 1. 


3. Consider again the conditions of Exercise 2, but sup- 
pose now that X has a discrete distribution with c.d.f. F(x), 
rather than a continuous distribution. Show that the con- 
clusion of Exercise 2 still holds. 


4. Suppose that X, Y, and Z are nonnegative random 
variables such that Pr(X + Y + Z < 1.3) = 1. Show that xX, 
Y, and Z cannot possibly have a joint distribution under 
which each of their marginal distributions is the uniform 
distribution on the interval [0, 1]. 


5. Suppose that the random variable X has mean pw and 
variance o7, and that Y =aX +b. Determine the values 
of a and b for which E(Y) =0 and Var(Y) = 1. 


6. Determine the expectation of the range of a random 
sample of size n from the uniform distribution on the 
interval [0, 1]. 


7. Suppose that an automobile dealer pays an amount X 
(in thousands of dollars) for a used car and then sells it for 
an amount Y. Suppose that the random variables X and Y 
have the following joint p.d.f:: 


fe y= | for0O<x <y <6, 


0 otherwise. 
Determine the dealer’s expected gain from the sale. 


8. Suppose that X;,..., X, formarandom sample of size 
n from a continuous distribution with the following p.d.f.: 


2x for0<x <1, 
fa)= 


0 otherwise. 


Let Y, = max{X,,..., X,,}. Evaluate E(Y,). 


9, Ifm isa median of the distribution of X, andif Y =r(X) 
is either a nondecreasing or a nonincreasing function of X, 
show that r(m) is a median of the distribution of Y. 


10. Suppose that Xj,..., X, are iid. random variables, 
each of which has a continuous distribution with median 
m. Let Y, = max{X1,..., X,}. Determine the value of 
Pr(Y, >m). 


11. Suppose that you are going to sell cola at a football 
game and must decide in advance how much to order. 
Suppose that the demand for cola at the game, in liters, 
has a continuous distribution with p.d.f. f(x). Suppose that 
you make a profit of g cents on each liter that you sell at 
the game and suffer a loss of c cents on each liter that you 
order but do not sell. What is the optimal amount of cola 
for you to order so as to maximize your expected net gain? 


12. Suppose that the number of hours X for which a ma- 
chine will operate before it fails has a continuous distribu- 
tion with p.d.f. f(x). Suppose that at the time at which the 
machine begins operating you must decide when you will 
return to inspect it. If you return before the machine has 
failed, you incur a cost of b dollars for having wasted an 
inspection. If you return after the machine has failed, you 
incur a cost of c dollars per hour for the length of time dur- 
ing which the machine was not operating after its failure. 
What is the optimal number of hours to wait before you 
return for inspection in order to minimize your expected 
cost? 


13. Suppose that X and Y are random variables for which 
E(X) =3, E(Y) =1, Var(X) = 4, and Var(Y) = 9. Let Z = 
5X —Y +15. Find E(Z) and Var(Z) under each of the 
following conditions: (a) X and Y are independent; (b) 
X and Y are uncorrelated; (ce) the correlation of X and Y 
is 0.25. 


14. Suppose that Xo, X;,..., X, are independent ran- 
dom variables, each having the same variance o”. Let 


1 n 


— X,_for j=1,...,n,andlet Y, = 
n 


Determine the value of Var(Y,,). 


15. Suppose that X,,..., X, are random variables for 

which Var(X;) has the same value o* fori=1,...,n and 

p(X;, X;) has the same value p for every pair of values i 
1 


and j such that i 4 j. Prove that p > — 7 
i 


16. Suppose that the joint distribution of X and Y is the 
uniform distribution over a rectangle with sides parallel 
to the coordinate axes in the xy-plane. Determine the 
correlation of X and Y. 


17. Suppose that n letters are put at random into n en- 
velopes, as in the matching problem described in Sec. 1.10. 
Determine the variance of the number of letters that are 
placed in the correct envelopes. 


18. Suppose that the random variable X has mean yz and 
variance o”. Show that the third central moment of X can 
be expressed as E(X?) — 302 — 3. 


19. Suppose that X is a random variable with m.g.f. y(t), 
mean i, and variance o”; and let c(t) = log[y(t)]. Prove 
that c’(0) = pw and c”(0) = o?. 


20. Suppose that X and Y have a joint distribution with 
means fy and jy, standard deviations oy and oy, and 
correlation p. Show that if E(Y|X) is a linear function of 
X, then 


(oy 
E(Y|X) = py + p—(X — py). 
Ox 


21. Suppose that X and Y are random variables such that 
E(Y|X) =7—(1/4)X and E(X|Y) =10 — Y. Determine 
the correlation of X and Y. 


22. Suppose that a stick having a length of 3 feet is broken 
into two pieces, and that the point at which the stick is 
broken is chosen in accordance with the p.d-f. f(x). What 
is the correlation between the length of the longer piece 
and the length of the shorter piece? 


23. Suppose that X and Y have a joint distribution with 
correlation p > 1/2 and that Var(X) = Var(Y) = 1. Show 


that b = — = is the unique value of b such that the corre- 


p 
lation of X and X + DY isalso p. 


24. Suppose that four apartment buildings A, B,C, and D 
are located along a highway at the points 0, 1, 3, and 5, as 
shown in the following figure. Suppose also that 10 percent 
of the employees of a certain company live in building A, 
20 percent live in B, 30 percent live in C, and 40 percent 
live in D. 


a. Where should the company build its new office in or- 
der to minimize the total distance that its employees 
must travel? 


b. Where should the company build its new office in 


order to minimize the sum of the squared distances 
that its employees must travel? 
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A B C D 
° ° | ° | ° | | 
0 1 2 3 4 5 6 7 


25. Suppose that X and Y have the following joint p.d-f.: 


es pi for0<y<x <1, 


0 otherwise. 
Suppose also that the observed value of X is 0.2. 


a. What predicted value of Y has the smallest M.S.E.? 
b. What predicted value of Y has the smallest M.A.E.? 


26. For all random variables X, Y, and Z, let Cov(X, Y|z) 
denote the covariance of X and Y in their conditional joint 
distribution given Z = z. Prove that 


Cov(X, Y) = E[Cov(X, Y|Z)] 
+ Cov[E(X|Z), E(Y|Z)]. 


27. Consider the box of red and blue balls in Exam- 
ples 4.2.4 and 4.2.5. Suppose that we sample n > 1 balls 
with replacement, and let X be the number of red balls in 
the sample. Then we sample n balls without replacement, 
and we let Y be the number of red balls in the sample. 
Prove that Pr(X =n) > Pr(Y =n). 


28. Suppose that a person’s utility function is U(x) = x? 
for x > 0. Show that the person will always prefer to take 
a gamble in which she will receive a random gain of X dol- 
lars rather than receive the amount £(X) with certainty, 
where Pr(X > 0) =1 and E(X) < ~. 


29. A person is given m dollars, which he must allocate 
between an event A and its complement A‘. Suppose that 
he allocates a dollars to A and m —a dollars to A‘. The 
person’s gain is then determined as follows: If A occurs, 
his gain is g,a; if A occurs, his gain is g.(m — a). Here, 
g, and g» are given positive constants. Suppose also that 
Pr(A) = p and the person’s utility function is U(x) = log x 
for x > 0. Determine the amount a that will maximize the 
person’s expected utility, and show that this amount does 
not depend on the values of g; and go. 
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The Normal Distributions 


5.1 Introduction 


In this chapter, we shall define and discuss several special families of distributions 
that are widely used in applications of probability and statistics. The distributions that 
will be presented here include discrete and continuous distributions of univariate, bi- 
variate, and multivariate types. The discrete univariate distributions are the families 
of Bernoulli, binomial, hypergeometric, Poisson, negative binomial, and geomet- 
ric distributions. The continuous univariate distributions are the families of normal, 
lognormal, gamma, exponential, and beta distributions. Other continuous univariate 
distributions (introduced in exercises and examples) are the families of Weibull and 
Pareto distributions. Also discussed is the multinomial family of multivariate discrete 
distributions, and the bivariate normal family of bivariate continuous distributions. 

We shall briefly describe how each of these families of distributions arise in 
applied problems and show why each might be an appropriate probability model 
for some experiment. For each family, we shall present the form of the p.f. or the 
p.d.f. and discuss some of the basic properties of the distributions in the family. 

The list of distributions presented in this chapter, or in this entire text for that 
matter, is not intended to be exhaustive. These distributions are known to be useful in 
a wide variety of applied problems. In many real-world problems, however, one will 
need to consider other distributions not mentioned here. The tools that we develop 
for use with these distributions can be generalized for use with other distributions. 
Our purpose in providing in-depth presentations of the most popular distributions 
here is to give the reader a feel for how to use probablity to model the variation and 
uncertainty in applied problems as well as some of the tools that get used during 
probability modeling. 


5.2 The Bernoulli and Binomial Distributions 


The simplest type of experiment has only two possible outcomes, call them 0 and 
1. If X equals the outcome from such an experiment, then X has the simplest 
type of nondegenerate distribution, which is a member of the family of Bernoulli 
distributions. If n independent random variables X,,..., X, all have the same 
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Bernoulli distribution, then their sum is equal to the number of the X ;’s that equal 1, 
and the distribution of the sum is a member of the binomial family. 


The Bernoulli Distributions 


A Clinical Trial. The treatment given to a particular patient in a clinical trial can 
either succeed or fail. Let X = 0 if the treatment fails, and let X = 1 if the treatment 
succeeds. All that is needed to specify the distribution of X is the value p = Pr(X = 1) 
(or, equivalently, 1 — p = Pr(X =0)). Each different p corresponds to a different 
distribution for X. The collection of all such distributions corresponding to all 0 < 
p <1form the family of Bernoulli distributions. < 


An experiment of a particularly simple type is one in which there are only two 
possible outcomes, such as head or tail, success or failure, defective or nondefective, 
patient recovers or does not recover. It is convenient to designate the two possible 
outcomes of such an experiment as 0 and 1, as in Example 5.2.1. The following recap 
of Definition 3.1.5 can then be applied to every experiment of this type. 


Bernoulli Distribution. A random variable X has the Bernoulli distribution with pa- 
rameter p (0 < p <1) if X can take only the values 0 and 1 and the probabilities 
are 


Pr(X =1)=p and Pr(X =0)=1-p. (5.2.1) 
The p.f. of X can be written as follows: 


p*(1— p)'* for x =0, 1, 


xlp)= 5.2.2 
f (x|p) iG (5.2.2) 


otherwise. 


To verify that this p.f. f(x|p) actually does represent the Bernoulli distribution 
specified by the probabilities (5.2.1), it is simply necessary to note that f(1|p) = p 
and f(O|p) =1- p. 

If X has the Bernoulli distribution with parameter p, then X? and X are the same 
random variable. It follows that 

EY) =12 p+0+1= py=p, 
E(X*) = E(X) =p, 
and 
Var(X) = E(X*) — [E(X)P = p( — p). 


Furthermore, the m.g.f. of X is 


w(t) = E(e'*) = pe’ +(1— p) for —co <t <oo. 


Bernoulli Trials/Process. If the random variables in a finite or infinite sequence Xj, 
X5,... are iid., and if each random variable X; has the Bernoulli distribution with 
parameter p, then it is said that X,, X>,... are Bernoulli trials with parameter p. An 
infinite sequence of Bernoulli trials is also called a Bernoulli process. 


Tossing a Coin. Suppose that a fair coin is tossed repeatedly. Let X; = 1 if a head is 
obtained on the ith toss, and let X; = 0 if a tail is obtained (i = 1, 2,...). Then the 
random variables X,, X>, ... are Bernoulli trials with parameter p = 1/2. < 


Example 
5.2.3 


Example 
5.2.4 


Example 
5.2.5 


Definition 
5.2.3 


Theorem 
5.2.1 
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Defective Parts. Suppose that 10 percent of the items produced by a certain machine 
are defective and the parts are independent of each other. We will sample n items at 
random and inspect them. Let X; = 1 if the ith item is defective, and let X; = 0 if it 
is nondefective (i =1,...,”). Then the variables X,,..., X, forma Bernoulli trials 
with parameter p = 1/10. < 


Clinical Trials. In the many clinical trial examples in earlier chapters (Example 4.7.8, 
for instance), the random variables X,, X>, ..., indicating whether each patient is a 
success, were conditionally Bernoulli trials with parameter p given P = p, where P 
is the unknown proportion of patients in a very large population who recover. < 


The Binomial Distributions 


Defective Parts. In Example 5.2.3, let X = X; +---+ X19, which equals the number 
of defective parts among the 10 sampled parts. What is the distribution of X? < 


As derived after Example 3.1.9, the distribution of X in Example 5.2.5 is the 
binomial distribution with parameters 10 and 1/10. We repeat the general definition 
of binomial distributions here. 


Binomial Distribution. A random variable X has the binomial distribution with pa- 
rameters n and p if X has a discrete distribution for which the p.f. is as follows: 


(7) p*(1— p)"* forx=0,1,2,...,n, 


f(xln, p) = | (5.2.3) 
0 


otherwise. 


In this distribution, n must be a positive integer, and p must lie in the interval 
O<p<l. 


Probabilities for various binomial distributions can be obtained from the table given 
at the end of this book and from many statistical software programs. 

The binomial distributions are of fundamental importance in probability and 
statistics because of the following result, which was derived in Sec. 3.1 and which we 
restate here in the terminology of this chapter. 


If the random variables X,,..., X, form Bernoulli trials with parameter p, and if 
X = X,+---+X,, then X has the binomial distribution with parameters n and p. 
rT 


When X is represented as the sum of n Bernoulli trials as in Theorem 5.2.1, the 
values of the mean, variance, and m.g.f. of X can be derived very easily. These values, 
which were already obtained in Example 4.2.5 and on pages 231 and 238, are 


E(X) = 5° E(X)) =np, 
i=l 


Var(X) =) Var(X;) =np(1— p). 
i=1 
and 


w(t) = E(e'*) =| | Ete) = (pe +1- p)”. (5.2.4) 
i=1 
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5.2.7 


The reader can use the m.g.f. in Eq. (5.2.4) to establish the following simple 
extension of Theorem 4.4.6. 


If Xy,..., X, are independent random variables, and if X; has the binomial distri- 
bution with parameters n; and p (i =1,..., k), then the sum X; +---+ X; has the 
binomial distribution with parameters n =n, +---+n, and p. rT] 


Theorem 5.2.2 also follows easily if we represent each X; as the sum of n; 
Bernoulli trials with parameter p. If n =n, +---+n,, and if all n trials are inde- 
pendent, then the sum X; +--- + X, will simply be the sum of n Bernoulli trials with 
parameter p. Hence, this sum must have the binomial distribution with parameters 
nand p. 


Castaneda v. Partida. Courts have used the binomial distributions to calculate proba- 
bilities of jury compositions from populations with known racial and ethnic composi- 
tions. In the case of Castaneda v. Partida, 430 U.S. 482 (1977), a local population was 
79.1 percent Mexican American. During a 2.5-year period, there were 220 persons 
called to serve on grand juries, but only 100 were Mexican Americans. The claim 
was made that this was evidence of discrimination against Mexican Americans in the 
grand jury selection process. The court did a calculation under the assumption that 
grand jurors were drawn at random and independently from the population each 
with probability 0.791 of being Mexican American. Since the claim was that 100 was 
too small a number of Mexican Americans, the court calculated the probability that a 
binomial random variable X with parameters 220 and 0.791 would be 100 or less. The 
probability is very small (less than 10~7°). Is this evidence of discrimination against 
Mexican Americans? The small probability was calculated under the assumption that 
X had the binomial distribution with parameters 220 and 0.791, which means that 
the court was assuming that there was no discrimination against Mexican Americans 
when performing the calculation. In other words, the small probability is the condi- 
tional probability of observing X < 100 given that there is no discrimination. What 
should be more interesting to the court is the reverse conditional probability, namely, 
the probability that there is no discrimination given that X = 100 (or given X < 100). 
This sounds like a case for Bayes’ theorem. After we introduce the beta distributions 
in Sec. 5.8, we shall show how to use Bayes’ theorem to calculate this probability 
(Examples 5.8.3 and 5.8.4). < 


Note: Bernoulli and Binomial Distributions. Every random variable that takes only 
the two values 0 and 1 must have a Bernoulli distribution. However, not every sum 
of Bernoulli random variables has a binomial distribution. There are two conditions 
needed to apply Theorem 5.2.1. The Bernoulli random variables must be mutually 
independent, and they must all have the same parameter. If either of these conditions 
fails, the distribution of the sum will not be a binomial distribution. When the court 
did a binomial calculation in Example 5.2.6, it was defining “no discrimination” to 
mean that jurors were selected independently and with the same probability 0.791 
of being Mexican American. If the court had defined “no discrimination” some 
other way, they would have needed to do a different, presumably more complicated, 
probability calculation. 

We conclude this section with an example that shows how Bernoulli and binomial 
calculations can improve efficiency when data collection is costly. 


Group Testing. Military and other large organizations are often faced with the need 
to test large numbers of members for rare diseases. Suppose that each test requires 
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a small amount of blood, and it is guaranteed to detect the disease if it is anywhere 
in the blood. Suppose that 1000 people need to be tested for a disease that affects 
1/5 of 1 percent of all people. Let X; =1 if person j has the disease and X ; = 0 if 
not, for j =1,..., 1000. We model the Xj as iid. Bernoulli random variables with 
parameter 0.002 for j =1,..., 1000. The most naive approach would be to perform 
1000 tests to see who has the disease. But if the tests are costly, there may be a more 
economical way to test. For example, one could divide the 1000 people into 10 groups 
of size 100 each. For each group, take a portion of the blood sample from each of 
the 100 people in the group and combine them into one sample. Then test each of 
the 10 combined samples. If none of the 10 combined samples has the disease, then 
nobody has the disease, and we needed only 10 tests instead of 1000. If only one of 
the combined samples has the disease, then we can test those 100 people separately, 
and we needed only 110 tests. 

In general, let Z, ; be the number of people in group i who have the disease for 
i=1,..., 10. Then each Z, ; has the binomial distribution with parameters 100 and 
0.002. Let ¥, ; =1if Z,; >Oand Y,; =Oif Z,; =0. Then each Yj ; has the Bernoulli 
distribution with parameter 


Pr(Z,; > 0) =1—Pr(Z,,; =0) =1— 0.998! = 0.181, 


and they are independent. Then Y, = © Y,,; is the number of groups whose mem- 
bers we have to test individually. Also, Y; has the binomial distribution with param- 
eters 10 and 0.181. The number of people that we need to test individually is 100Yj. 
The mean of 100Y, is 100 x 10 x 0.181 = 181. So, the expected total number of tests is 
10 + 181 = 191, rather than 1000. One can compute the entire distribution of the to- 
tal number of tests, 100Y, + 10. The maximum number of tests needed by this group 
testing procedure is 1010, which would be the case if all 10 groups had at least one 
person with the disease, but this has probability 3.84 x 10~®. In all other cases, group 
testing requires fewer than 1000 tests. 

There are multiple-stage versions of group testing in which each of the groups 
that tests positive is split further into subgroups which are each tested together. If 
each of those subgroups is sufficiently large, they can be further subdivided into 
smaller sub-subgroups, etc. Finally, only the final-stage subgroups that have a positive 
result are tested individually. This can further reduce the expected number of tests. 
For example, consider the following two-stage version of the procedure described 
earlier. We could divide each of the 10 groups of 100 people into 10 subgroups of 
10 people each. Following the above notation, let Z, ; , be the number of people in 
subgroup k of group i who have the disease, fori =1,..., 10andk=1,..., 10. Then 
each Z) ; , has the binomial distribution with parameters 10 and 0.002. Let Y) ;,=1 
if Z> ,, >Oand Y, ; , =0 otherwise. Notice that Y> ;,=Ofork=1,..., 10 for every 
i such that Y, ; = 0. So, we only need to test individuals in those subgroups such that 
Yo j.4 = 1. Each Y> ; , has the Bernoulli distribution with parameter 


Pr(Zp ; 4 > 0) =1—Pr(Zo,,,=0) =1— 0.998 = 0.0198, 


and they are independent. Then Y> = ae De Y> ; , is the number of groups whose 
members we have to test individually. Also, Y, has the binomial distribution with 
parameters 100 and 0.0198. The number of people that we need to test individually is 
10Y>. The mean of 10Y> is 10 x 100 x 0.0198 = 19.82. The number of subgroups that 
we need to test in the second stage is Y;, whose mean is 1.81. So, the expected total 
number of tests is 10 + 1.81 + 19.82 = 31.63, which is even smaller than the 191 for 
the one-stage procedure described earlier. «J 
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Summary 


A random variable X has the Bernoulli distribution with parameter p if the p.f. of X 


is f (x|p) = p*(1— p)!~* for x =0, 1 and 0 otherwise. If X;, . . 


., X, are iid. random 


variables all having the Bernoulli distribution with parameter p, then we refer to 
X,,..., X, as Bernoulli trials, and X = )°"_, X; has the binomial distribution with 
parameters n and p. Also, X is the number of successes in the n Bernoulli trials, where 
success on trial i corresponds to X; = 1 and failure corresponds to X; = 0. 


Exercises 


1. Suppose that X is a random variable such that E(X*) = 
1/3 fork =1,2,.... Assuming that there cannot be more 
than one distribution with this same sequence of moments 
(see Exercise 14), determine the distribution of X. 


2. Suppose that a random variable X can take only the 
two values a and b with the following probabilities: 


Pr(X¥ =a)=p and Pr(x=b)=1-p. 


Express the p.f. of X in a form similar to that given in 
Eq. (5.2.2). 


3. Suppose that a fair coin (probability of heads equals 
1/2) is tossed independently 10 times. Use the table of the 
binomial distribution given at the end of this book to find 
the probability that strictly more heads are obtained than 
tails. 


4. Suppose that the probability that a certain experiment 
will be successful is 0.4, and let X denote the number 
of successes that are obtained in 15 independent perfor- 
mances of the experiment. Use the table of the binomial 
distribution given at the end of this book to determine the 
value of Pr(6 < X <9). 


5. A coin for which the probability of heads is 0.6 is tossed 
nine times. Use the table of the binomial distribution given 
at the end of this book to find the probability of obtaining 
an even number of heads. 


6. Three men A, B, and C shoot at a target. Suppose that 
A shoots three times and the probability that he will hit 
the target on any given shot is 1/8, B shoots five times and 
the probability that he will hit the target on any given shot 
is 1/4, and C shoots twice and the probability that he will 
hit the target on any given shot is 1/2. What is the expected 
number of times that the target will be hit? 


7. Under the conditions of Exercise 6, assume also that all 
shots at the target are independent. What is the variance 
of the number of times that the target will be hit? 


8. A certain electronic system contains 10 components. 
Suppose that the probability that each individual com- 
ponent will fail is 0.2 and that the components fail inde- 


pendently of each other. Given that at least one of the 
components has failed, what is the probability that at least 
two of the components have failed? 


9. Suppose that the random variables X;,..., X,, formn 
Bernoulli trials with parameter p. Determine the condi- 
tional probability that X; = 1, given that 


ee (k=1,...,n). 
i=1 


10. The probability that each specific child in a given fam- 
ily will inherit a certain disease is p. If it is known that at 
least one child in a family of n children has inherited the 
disease, what is the expected number of children in the 
family who have inherited the disease? 


11. For 0 < p <1, andn =2, 3,..., determine the value 
of 


n 


Yo xe - 1 (")ova — py". 
x 


x=2 


12. If a random variable X has a discrete distribution 
for which the p.f. is f(x), then the value of x for which 
f(x) is maximum is called the mode of the distribution. 
If this same maximum f(x) is attained at more than one 
value of x, then all such values of x are called modes of 
the distribution. Find the mode or modes of the binomial 
distribution with parameters n and p. Hint: Study the ratio 


f(x +1|n, p)/f (ln, p). 


13. In aclinical trial with two treatment groups, the prob- 
ability of success in one treatment group is 0.5, and the 
probability of success in the other is 0.6. Suppose that 
there are five patients in each group. Assume that the 
outcomes of all patients are independent. Calculate the 
probability that the first group will have at least as many 
successes as the second group. 


14. In Exercise 1, we assumed that there could be at 
most one distribution with moments E(X*) = 1/3 for 
k=1,2,.... In this exercise, we shall prove that there 
can be only one such distribution. Prove the following 


facts and show that they imply that at most one distribu- 
tion has the given moments. 
a. Pr(|X| <1) =1. (If not, show that lim, _,,, E(X**) = 
b. Pr(X? € {0, 1}) =1. (If not, prove that E(X*) < 
E(X?).) 
c. Pr(X = —1) =0. (If not, prove that E(X) < E(X?).) 
15. In Example 5.2.7, suppose that we use the two-stage 


version described at the end of the example. What is the 
maximum number of tests that could possibly be needed 
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by this version? What is the probability that the maximum 
number of tests would be required? 


16. For the 1000 people in Example 5.2.7, suppose that 
we use the following three-stage group testing procedure. 
First, divide the 1000 people into five groups of size 200 
each. For each group that tests positive, further divide it 
into five subgroups of size 40 each. For each subgroup that 
tests positive, further divide it into five sub-subgroups of 
size 8 each. For each sub-subgroup that tests positive, test 
all eight people. Find the expected number and maximum 
number of tests. 


5.3 The Hypergeometric Distributions 


In this section, we consider dependent Bernoulli random variables. A common 
source of dependent Bernoulli random variables is sampling without replacement 
from a finite population. Suppose that a finite population consists of a known 
number of successes and failures. If we sample a fixed number of units from that 
population, the number of successes in our sample will have a distribution that is 
a member of the family of hypergeometric distributions. 


Definition and Examples 


Example 
5.3.1 


Sampling without Replacement. Suppose that a box contains A red balls and B blue 
balls. Suppose also that n > 0 balls are selected at random from the box without 


replacement, and let X denote the number of red balls that are obtained. Clearly, 
we must have n < A+ B or we would run out of balls. Also, if n =0, then X =0 
because there are no balls, red or blue, drawn. For cases with n > 1, we can let 
X; =1if the ith ball drawn is red and X; = 0 if not. Then each X; has a Bernoulli 


distribution, but Xj, .. 


., X, are not independent in general. To see this, assume 


that both A > 0 and B > 0 as well as n > 2. We will now show that Pr(X7 = 1|X, = 
0) A Pr(X> = 1|X, = 1). If X; =1, then when the second ball is drawn there are 
only A — 1 red balls remaining out of a total of A + B — 1 available balls. Hence, 
Pr(X) = 1|X, = 1) = (A — 1)/(A + B — 1). By the same reasoning, 


Pr(X> => 1X, => 0) = 


A oAal 
A+B-1 A+B-1 


Hence, X> is not independent of X,, and we should not expect X to have a binomial 


distribution. 


< 


The problem described in Example 5.3.1 is a template for all cases of sampling 
without replacement from a finite population with only two types of objects. Any- 
thing that we learn about the random variable X in Example 5.3.1 will apply to every 
case of sampling without replacement from finite populations with only two types of 
objects. First, we derive the distribution of X. 
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Probability Function. The distribution of X in Example 5.3.1 has the p.f. 


ee 


max{0, n — B} <x <min{n, A}, (5.3.2) 


f(x|A, Bn) = (5.3.1) 


for 


and f(x|A, B, n) = 0 otherwise. 


Proof Clearly, the value of X can neither exceed n nor exceed A. Therefore, it must 
be true that X < min{n, A}. Similarly, because the number of blue balls n — X that 
are drawn cannot exceed B, the value of X must be at least n — B. Because the value 
of X cannot be less than 0, it must be true that X > max{0, n — B}. Hence, the value 
of X must be an integer in the interval in (5.3.2). 

We shall now find the p.f. of X using combinatorial arguments from Sec. 1.8. The 
degenerate cases, those with A, B, and/or n equal to 0, are easy to prove because 
(5) = 1 for all nonnegative k, including k = 0. For the cases in which all of A, B, andn 
are strictly positive, there are ‘Gee ) ways to choose n balls out of the A + B available 
balls, and all of these choices are equally likely. For each integer x in the interval 


(5.3.2), there are (4) ways to choose x red balls, and for each such choice there are 


(2 ,) Ways to choose n — x blue balls. Hence, the probability of obtaining exactly x 


red balls out of n is given by Eq. (5.3.1). Furthermore, f(x|A, B, 2) must be 0 for all 
other values of x, because all other values are impossible. rT] 


Hypergeometric Distribution. Let A, B, andn be nonnegative integers withn < A+ B. 
If a random variable X has a discrete distribution with p.f. as in Eqs. (5.3.1) and 
(5.3.2), then it is said that X has the hypergeometric distribution with parameters A, 
B, and n. 


Sampling without Replacement from an Observed Data Set. Consider the patients in the 
clinical trial whose results are tabulated in Table 2.1. We might need to reexamine a 
subset of the patients in the placebo group. Suppose that we need to sample 11 distinct 
patients from the 34 patients in that group. What is the distribution of the number of 
successes (no relapse) that we obtain in the subsample? Let X stand for the number 
of successes in the subsample. Table 2.1 indicates that there are 10 successes and 
24 failures in the placebo group. According to the definition of the hypergeometric 
distribution, X has the hypergeometric distribution with parameters A = 10, B = 24, 
and n = 11. In particular, the possible values of X are the integers from 0 to 10. Even 
though we sample 11 patients, we cannot observe 11 successes, since only 10 successes 
are available. < 


The Mean and Variance for a Hypergeometric Distribution 


Mean and Variance. Let X have a hypergeometric distribution with strictly positive 
parameters A, B, andn. Then 
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nA 
A+B’ 

nAB A+B-n 
(A+B) A+B—1 


E(X)= 


(5.3.3) 


Var(X) = (5.3.4) 
Proof Assume that X is as defined in Example 5.3.1, the number of red balls drawn 
when n balls are selected at random without replacement from a box containing A 
red balls and B blue balls. Fori=1,...,n, let X; =1if the ith ball that is selected 
is red, and let X; = 0 if the ith ball is blue. As explained in Example 4.2.4, we can 
imagine that the n balls are selected from the box by first arranging all the balls in the 
box in some random order and then selecting the first n balls from this arrangement. 


It can be seen from this interpretation that, fori =1,...,n, 
A B 
Pr(x; = 1) = —— and Pr(x;=0)= : 
A+B A+B 
Therefore, fori =1,..., 7, 
E(X;)= = and Var(X;) = aa (5.3.5) 
A+B (A + B)? 
Since X = X,+---+X,, the mean of X is the sum of the means of the X;’s, namely, 
Eq. (5.3.3). 
Next, use Theorem 4.6.7 to write 
n 
Var(X) = > Var(X;) +2 > Di Cov(X;, Xj). (5.3.6) 
i=l i<j 
Because of the symmetry among the random variables X;,..., X,, every term 


Cov(X;, X;) in the final summation in Eq. (5.3.6) will have the same value as 
Cov(X,, X). Since there are (5) terms in this summation, it follows from Eggs. (5.3.5) 
and (5.3.6) that 
nAB 
(A + B)? 
We could compute Cov(X,, X>) directly, but it is simpler to argue as follows. If 
n=A-+B,then Pr(xX = A) = 1 because all the balls in the box will be selected without 
replacement. Thus, for n = A + B, X is a constant random variable and Var(X) = 0. 
Setting Eq. (5.3.7) to 0 and solving for Cov(X 1, X) gives 


Var(X) = 


+ n(n 1) Cov(X}, X). (5.3.7) 


AB 
Cov(X1, X7) = : 
ova") = GT Be A+ BD 
Plugging this value back into Eq. (5.3.7) gives Eq. (5.3.4). | 


Comparison of Sampling Methods 


If we had sampled with replacement in Example 5.3.1, the number of red balls would 
have the binomial distribution with parameters n and A/(A + B). In that case, the 
mean number of red balls would still be nA/(A + B), but the variance would be 
different. To see how the variances from sampling with and without replacement are 
related, let T = A + B denote the total number of balls in the box, and let p = A/T 
denote the proportion of red balls in the box. Then Eq. (5.3.4) can be rewritten as 
follows: 

iL 


T 


Var(X) =np(1 — p) . (5.3.8) 
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The variance np(1 — p) of the binomial distribution is the variance of the number 
of red balls when sampling with replacement. The factor a = (T —n)/(T — 1) in 
Eq. (5.3.8) therefore represents the reduction in Var(X) caused by sampling without 
replacement from a finite population. This @ is called the finite population correction 
in the theory of sampling from finite populations without replacement. 

If n = 1, the value of this factor aw is 1, because there is no distinction between 
sampling with replacement and sampling without replacement when only one ball is 
being selected. If n = T, then (as previously mentioned) a = 0 and Var(X) = 0. For 
values of n between 1 and 7, the value of a will be between 0 and 1. 

For each fixed sample size n, it can be seen that a > 1 as T > ov. This limit 
reflects the fact that when the population size T is very large compared to the sample 
size n, there is very little difference between sampling with replacement and sampling 
without replacement. Theorem 5.3.4 expresses this idea more formally. The proof 
relies on the following result which gets used several times in this text. 


Let a, and c, be sequences of real numbers such that a, converges to 0, and eum 
converges to 0. Then 


lim (1+ a,)%e% =1. 

noo 
In particular, if a,c,, converges to b, then (1 + a,)“ converges to e?. a 
The proof of Theorem 5.3.3 is left to the reader in Exercise 11. 


Closeness of Binomial and Hypergeometric Distributions. Let 0 < p <1, and let n be 
a positive integer. Let Y have the binomial distribution with parameters n and p. 
For each positive integer T, let Ay and B, be integers such that limy_,,, Ar = 00, 
limy_,,, Br = o, and lim;_,,, Ar/(Ar + Br) = p. Let X7 have the hypergeometric 
distribution with parameters A;, B;, and n. For each fixed n and each x =0,..., 7, 


PrY=x) _ 


(itt ae (5.3.9) 


Proof Once A; and By are both larger than n, the formula in (5.3.1) is Pr(X 7 = x) 


for allx =0,..., 7. So, for large T, we have 
1B! =)! 
Pr(X _ x) _ (") Ar!Br\(Ar + Br n)! ; 


Apply Stirling’s formula (Theorem 1.7.5) to each of the six factorials in the second 
factor above. A little manipulation gives that 


(Ye ale mma oc, 4 Br _ pyre tertile 


im 
T>00 Pr(Xy — x)(Ar = x)Ar—*+1/2(By =p x)Br—atxtl/2(A4 5 te BryArtarti/2 
(5.3.10) 


equals 1. Each of the following limits follows from Theorem 5.3.3: 


Arp—-x+1/2 
: A 7 : 
lim ( £ ) =e* 


T—>0o Ar = 
B Br—n+x+4+1/2 
lim (——2— =e" 
Too \ Bp -n+x 


II 
ios) 


lim (* + Br - vo =n 


Example 
5.3.3 
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Inserting these limits in (5.3.10) yields 
()Ar Br * 


lim =1. (5.3.11) 
T>0o Pr(X7 =x)(Az + Br)" 
Since A;/(Ar + Br) converges to p, we have 
x n—-Xx 
im —11_ = p*(1- p)"™. spe 
fi, Ta; + By p(1— p) ( ) 
Together, (5.3.11) and (5.3.12) imply that 
nN) 1x 1-— n—-X 
tine Eg 
T>oo §=Pr(X; =x) 
The numerator of this last expression is Pr(Y = x); hence, (5.3.9) holds. a 


In words, Theorem 5.3.4 says that if the sample size n represents a negligible fraction 
of the total population A + B, then the hypergeometric distribution with parameters 
A, B, and n will be very nearly the same as the binomial distribution with parameters 
nand p=A/(A+ B). 


Population of Unknown Composition. The hypergeometric distribution can arise as a 
conditional distribution when sampling is done without replacement from a finite 
population of unknown composition. The simplest example would be to modify 
Example 5.3.1 so that we still know the value of T= A+ B but no longer know 
A and B. That is, we know how many balls are in the box, but we don’t know how 
many are red or blue. This makes P = A/T, the proportion of red balls, unknown. 
Let h(p) be the p.f. of P. Here P is a random variable whose possible values are 
0,1/T,...,(T —1)/T, 1. Conditional on P = p, we can behave as if we know that 
A= pT and B= (1-— p)T, and then the conditional distribution of X (the number 
of red balls in a sample of size n) is the hypergeometric distribution with parameters 
pT,(1— p)T, and n. 

Suppose now that T is so large that the difference is essentially negligible be- 
tween this hypergeometric distribution and the binomial distribution with parame- 
ters n and p. In this case, it is no longer necessary that we assume that T is known. 
This is the situation that we had in mind (in Examples 3.4.10 and 3.6.7, as well as 
their many variations and other examples) when we referred to P as the proportion 
of successes among all patients who might receive a treatment or the proportion of 
defectives among all parts produced by a machine. We think of T as essentially infi- 
nite so that conditional on the proportion A/T, which we call P, the individual draws 
become independent Bernoulli trials. If either A or T (or both) is unknown, it makes 
sense that P = A/T will be unknown. In the augmented experiment described on 
page 61, in which P can be computed from the experimental outcome, we have that 
P is arandom variable. < 


Note: Essentially Infinite Populations. The case in which T is essentially infinite 
in Example 5.3.3 is the motivation for using the binomial distributions as models 
for numbers of successes in samples from very large finite populations. Look at 
Example 5.2.6, for instance. The number of Mexican Americans available to be 
sampled for grand jury duty is finite, but it is huge relative to the number (220) of 
grand jurors selected during the 2.5-year period. Technically, it is impossible that the 
individual grand jurors are selected independently, but the difference is too small for 
even the best defense attorney to make anything out of it. In the future, we will often 
model Bernoulli random variables as independent when we imagine selecting them 
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at random without replacement from a huge finite population. We shall be relying 
on Theorem 5.3.4 in these cases without explicitly saying so. 


Extending the Definition of Binomial Coefficients 


There is an extension of the definition of a binomial coefficient given in Sec. 1.8 
that allows a simplification of the expression for the p.f. of the hypergeometric 
distribution. For all positive integers r and m, where r < m, the binomial coefficient 
("") was defined to be 


r 


m m! 
("") ~ Fin — nt (5.3.13) 


It can be seen that the value of (’”) specified by Eq. (5.3.13) can also be written 
in the form 


(5.3.14) 


(")-mendeenr) 


r r! 


For every real number m that is not necessarily a positive integer and every 
positive integer r, the value of the right side of Eq. (5.3.14) is a well-defined number. 
Therefore, for every real number m and every positive integer r, we can extend 
the definition of the binomial coefficient (”") by defining its value as that given by 
Eq. (5.3.14). 

The value of the binomial coefficient (”") can be obtained from this definition 
for all positive integers r and m. If r < m, the value of (”") is given by Eq.(5.3.13). If 
r >m, one of the factors in the numerator of (5.3.14) will be 0 and (’") = 0. Finally, 
for every real number m, we shall define the value of (f) to be (5) = 1. 

When this extended definition of a binomial coefficient is used, it can be seen 
that the value of (“)(,”_) is 0 for every integer x such that either x > A orn —x > B. 
Therefore, we can write the p.f. of the hypergeometric distribution with parameters 
A, B, and n as follows: 

(B)(n2 x 


f(x|A, B,n) = “ny for x = 0, 1l,...,n, (5.3.15) 
n 
0 otherwise. 
It then follows from Eq. (5.3.14) that f(x|A, B, n) > Oif and only if x is an integer in 
the interval (5.3.2). 
>, 
“ 


Summary 


We introduced the family of hypergeometric distributions. Suppose that n units are 
drawn at random without replacement from a finite population consisting of T units 
of which A are successes and B = T — A are failures. Let X stand for the number of 
successes in the sample. Then the distribution of X is the hypergeometric distribution 
with parameters A, B, and n. We saw that the distinction between sampling from 
a finite population with and without replacement is negligible when the size of the 
population is huge relative to the size of the sample. We also generalized the binomial 
coefficient notation so that ("") is defined for all real numbers m and all positive 
integers r. 


Exercises 


1. In Example 5.3.2, compute the probability that all 10 
success patients appear in the subsample of size 11 from 
the Placebo group. 


2. Suppose that a box contains five red balls and ten blue 
balls. If seven balls are selected at random without re- 
placement, what is the probability that at least three red 
balls will be obtained? 


3. Suppose that seven balls are selected at random with- 
out replacement from a box containing five red balls and 
ten blue balls. If X denotes the proportion of red balls in 
the sample, what are the mean and the variance of X? 


4. If a random variable X has the hypergeometric distri- 
bution with parameters A = 8, B = 20, and n, for what 
value of n will Var(X) be a maximum? 


5. Suppose that n students are selected at random without 
replacement from a class containing T students, of whom 
Aare boysand T — A are girls. Let X denote the number of 
boys that are obtained. For what sample size n will Var(X) 
be a maximum? 


6. Suppose that X, and X> are independent random vari- 
ables, that X, has the binomial distribution with param- 
eters n, and p, and that X> has the binomial distribution 
with parameters n, and p, where p is the same for both X 
and X>. For each fixed value of k (k =1, 2,...,, +7), 
prove that the conditional distribution of X, given that 
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X,+X,=k is hypergeometric with parameters nj, 1, 
and k. 


7. Suppose that in a large lot containing T manufactured 
items, 30 percent of the items are defective and 70 per- 
cent are nondefective. Also, suppose that ten items are 
selected at random without replacement from the lot. De- 
termine (a) an exact expression for the probability that not 
more than one defective item will be obtained and (b) an 
approximate expression for this probability based on the 
binomial distribution. 


8. Consider a group of T persons, and let aj, ..., ar de- 
note the heights of these T persons. Suppose that n per- 
sons are selected from this group at random without re- 
placement, and let X denote the sum of the heights of 
these n persons. Determine the mean and variance of X. 


9, Find the value of Co. 


10. Show that for all positive integers n and k, 
—n n+k—1 
= (-1)* 
or) 
11. Prove Theorem 5.3.3. Hint: Prove that 
lim c, log1+a,) — a,c, =0 
noo 


by applying Taylor’s theorem with remainder (see Exer- 
cise 13 in Sec. 4.2) to the function f(x) = log(1 + x) around 
x=0. 


5.4 The Poisson Distributions 


Many experiments consist of observing the occurrence times of random arrivals. 
Examples include arrivals of customers for service, arrivals of calls at a switch- 
board, occurrences of floods and other natural and man-made disasters, and so 
forth. The family of Poisson distributions is used to model the number of such 
arrivals that occur in a fixed time period. Poisson distributions are also useful 
approximations to binomial distributions with very small success probabilities. 


Definition and Properties of the Poisson Distributions 


Example 
5.4.1 


Customer Arrivals. A store owner believes that customers arrive at his store at a rate 
of 4.5 customers per hour on average. He wants to find the distribution of the actual 


number X of customers who will arrive during a particular one-hour period later in 
the day. He models customer arrivals in different time periods as independent of each 
other. As a first approximation, he divides the one-hour period into 3600 seconds and 
thinks of the arrival rate as being 4.5/3600 = 0.00125 per second. He then says that 
during each second either 0 or 1 customers will arrive, and the probability of an arrival 
during any single second is 0.00125. He then tries to use the binomial distribution with 
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Definition 
5.4.1 


Theorem 
5.4.1 


parameters n = 3600 and p = 0.00125 for the distribution of the number of customers 
who arrive during the one-hour period later in the day. 

He starts calculating f, the p.f. of this binomial distribution, and quickly discovers 
how cumbersome the calculations are. However, he realizes that the successive values 
of f(x) are closely related to each other because f(x) changes in a systematic way 
as x increases. So he computes 


iit saat) Cie — pyr (n—x)p | mp 


fx) ()ped—py*———o@+Dd-p) xt 


where the reasoning for the approximation at the end is as follows: For the first 30 
or so values of x, — x is essentially the same as n and dividing by 1 — p has almost 
no effect because p is so small. For example, for x = 30, the actual value is 0.1441, 
while the approximation is 0.1452. This approximation suggests defining 4 = np and 
approximating f(x + 1) * f(x)A/(@ + 1) for all the values of x that matter. That is, 


fd) = FO), 


rN 2 
£2)=fi0; = Os 


N Ne 
f= fQz = FOZ. 


Continuing the pattern for all x yields f(x) = f(0)A*/x! for all x. To obtain a p.f. for 
X, he would need to make sure that )°° , f(x) = 1. This is easily achieved by setting 


1 
0) = ———__ =e’, 
fO= say A 
where the last equality follows from the following well-known calculus result: 
OO. 4% 
= =a (5.4.1) 
x=0 XxX. 
for all A > 0. Hence, f (x) = e*a*/x! for x =0,1,... and f(x) =0 otherwise is a p.f. 


<q 


The approximation formula for the p.f. of a binomial distribution at the end 
of Example 5.4.1 is actually a useful p.f. that can model many phenomena of types 
similar to the arrivals of customers. 


Poisson Distribution. Let 4 > 0. A random variable X has the Poisson distribution 
with mean d if the p.f. of X is as follows: 
eA 
f(xlay = ee for x = 0, fee arene (5.4.2) 


0 otherwise. 


At the end of Example 5.4.1, we proved that the function in Eq. (5.4.2) is indeed 
a p.f. In order to justify the phrase “with mean 4” in the definition of the distribution, 
we need to prove that the mean is indeed jd. 


Mean. The mean of the distribution with p.f. equal to (5.4.2) is A. 


Example 
5.4.2 
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Proof If X has the distribution with p.f. f(x|A), then E(X) is given by the following 
infinite series: 
loo} 


E(X) = >> xf (la). 


x=0 


Since the term corresponding to x = 0 in this series is 0, we can omit this term and 
can begin the summation with the term for x = 1. Therefore, 


oo Oe ene 00 e4jx-1 
E(X)= xf (x|A) = x =k ‘ 
w=) afany= x a3 
x=1 x=1 x=1 
If we now let y = x — 1 in this summation, we obtain 
OO BAdy 
e *d 
E(X)=A dX = 
y= 


The sum of the series in this equation is the sum of f(y|A), which equals 1. Hence, 
EQ) =i. = 


Customer Arrivals. In Example 5.4.1, the store owner was approximating the binomial 
distribution with parameters 3600 and 0.00125 with a distribution that we now know 
as the Poisson distribution with mean 4 = 3600 x 0.00125 = 4.5. For x =0,..., 9, 
Table 5.1 has the binomial and corresponding Poisson probabilities. 

The division of the one-hour period into 3600 seconds was somewhat arbitrary. 
The owner could have divided the hour into 7200 half-seconds or 14400 quarter- 
seconds, etc. Regardless of how finely the time is divided, the product of the number 
of time intervals and the rate in customers per time interval will always be 4.5 because 
they are all based on a rate of 4.5 customers per hour. Perhaps the store owner would 
do better simply modeling the number X of arrivals as a Poisson random variable with 
mean 4.5, rather than choosing an arbitrarily sized time interval to accommodate a 
tedious binomial calculation. The disadvantage to the Poisson model for X is that 
there is positive probability that a Poisson random variable will be arbitrarily large, 
whereas a binomial random variable with parameters n and p can never exceed n. 
However, the probability is essentially 0 that a Poisson random variable with mean 
4.5 will exceed 19. < 


Table 5.1 Binomial and Poisson probabilities in Example 5.4.2 


x 


0 1 Z a) 4 


Binomial 0.01108 0.04991 0.11241 0.16874 0.18991 
Poisson 0.01111 0.04999 0.11248 0.16872 0.18981 


Binomial 0.17094 0.12819 0.08237 0.04630 0.02313 
Poisson 0.17083 0.12812 0.08237 0.04633 0.02317 
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Theorem Variance. The variance of the Poisson distribution with mean d is also A. 
5.4.2 
Proof The variance can be found by a technique similar to the one used in the 
proof of Theorem 5.4.1 to find the mean. We begin by considering the following 
expectation: 


[o,@) [o,2) 


E[X(X —D] =o x(@— DFA) = Yo xe — DFA) 


x=0 x=2 


eo Alyx 
=) «@-1* “ 
x=2 : 


enh x-2 


_2y 
=> dX ear 


xX 
If we let y = x — 2, we obtain 
ery ee 


E[X(X —D]=2 3 =)’. (5.4.3) 


! 
y=0 


Since E[X(X — 1)] = E(X*) — E(X) = E(X’) — A, it follows from Eq. (5.4.3) that 
E(X?) =i2 +4. Therefore, 


Var(X) = E(X*) —[E(X)]* =2. (5.4.4) 
Hence, the variance is also equal to i. rT] 
Theorem Moment Generating Function. The m.g.f. of the Poisson distribution with mean d is 
5.4.3 , 
oH, (5.4.5) 
for all real r. 
Proof For every value of t (—coo <t < ow), 
CO Btx phy x os tyx 
= 1X\ eve *h* iy (Ae) 
de a eee 
x=0 x=0 
It follows from Eq. (5.4.1) that, for —co <t < ~, 
vWtH= e rere’ — ede’), a 
The mean and the variance, as well as all other moments, can be determined 
from the m.g.f. given in Eq. (5.4.5). We shall not derive the values of any other 
moments here, but we shall use the m.g.f. to derive the following property of Poisson 
distributions. 
Theorem If the random variables X,,..., X; are independent and if X; has the Poisson dis- 
5.4.4 tribution with mean A; (i =1,...,k), then the sum X, +---+ X; has the Poisson 


distribution with mean A; +--+ + A,. 


Proof Let w;(t) denote the m.gf. of X; fori =1,...,k, and let w(t) denote the 
m.g.f. of the sum X, +---+ X;. Since Xj,..., X, are independent, it follows that, 
for —co <t<o, 


k k 
i(e!— es t 
w(t) = I] W(t) = I] ehile—1) pate tayi(el—D 


i=1 i=1 


Example 
5.4.3 


Theorem 
5.4.5 
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It can be seen from Eq. (5.4.5) that this m.g.f. w(t) is the m.g.f. of the Poisson 
distribution with mean A; +---+A,. Hence, the distribution of X; + +--+ X,; must 
be as stated in the theorem. a 


A table of probabilities for Poisson distributions with various values of the mean 
A is given at the end of this book. 


Customer Arrivals. Suppose that the store owner in Examples 5.4.1 and 5.4.2 is in- 
terested not only in the number of customers that arrive in the one-hour period, 
but also in how many customers arrive in the next hour after that period. Let Y be 
the number of customers that arrive in the second hour. By the reasoning at the 
end of Example 5.4.2, the owner might model Y as a Poisson random variable with 
mean 4.5. He would also say that X and Y are independent because he has been 
assuming that arrivals in disjoint time intervals are independent. According to Theo- 
rem 5.4.4, X + Y would have the Poisson distribution with mean 4.5 + 4.5 = 9. What 
is the probability that at least 12 customers will arrive in the entire two-hour period? 
We can use the table of Poisson probabilities in the back of this book by looking in 


the 4 = 9 column. Either add up the numbers corresponding to k=0,..., 11 and 
subtract the total from 1, or add up those from k = 12 to the end. Either way, the 
result is Pr(X > 12) = 0.1970. <j 


The Poisson Approximation to Binomial Distributions 


In Examples 5.4.1 and 5.4.2, we illustrated how close the Poisson distribution with 
mean 4.5 is to the binomial distribution with parameters 3600 and 0.00125. We shall 
now demonstrate a general version of that result, namely, that when the value of n 
is large and the value of p is close to 0, the binomial distribution with parameters n 
and p can be approximated by the Poisson distribution with mean np. 


Closeness of Binomial and Poisson Distributions. For each integer n and each0 < p <1, 
let f(x|n, p) denote the p.f. of the binomial distribution with parameters n and p. 
Let f(x|A) denote the p.f. of the Poisson distribution with mean A. Let {p,}"°, be a 
sequence of numbers between 0 and 1 such that lim,_,., np, = 4. Then 


im, f(x|n, Pn) = FOIA), 


for allx =0,1,.... 


Proof We begin by writing 


(n—1)---(n-—x +1) pr ae 


n 
f(x|n, Pn) — ! 
x? 


Next, let A,, =np, so that lim,_,,,4, =A. Then f(x|n, p,) can be rewritten in the 
following form: 


Peal _ = n =z 
Fi gs zHi(1 *s) (1 *s) (5.46) 
nN 


x!n n n n 


For each x > 0, 
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Example 
5.4.4 


Theorem 
5.4.6 


Example 
5.4.5 


Furthermore, it follows from Theorem 5.3.3 that 


i, n 
lim (1 = “a =e, (5.4.7) 
nC n 
It now follows from Eq. (5.4.6) that for every x > 0, 
‘ rms fod 
lim f(x|n, Py) = = f(x). 2 
noo x! 


Approximating a Probability. Suppose that in a large population the proportion of 
people who have a certain disease is 0.01. We shall determine the probability that in 
a random group of 200 people at least four people will have the disease. 

In this example, we can assume that the exact distribution of the number of 
people having the disease among the 200 people in the random group is the binomial 
distribution with parameters n = 200 and p = 0.01. Therefore, this distribution can 
be approximated by the Poisson distribution for which the mean is A = np = 2. If X 
denotes a random variable having this Poisson distribution, then it can be found from 
the table of the Poisson distribution at the end of this book that Pr(X > 4) = 0.1428. 
Hence, the probability that at least four people will have the disease is approximately 
0.1428. The actual value is 0.1420. < 


Theorem 5.4.5 says that ifn is large and p is small so that np is close to A, then the 
binomial distribution with parameters n and p is close to the Poisson distribution with 
mean i. Recall Theorem 5.3.4, which says that if A and B are large compared ton and 
if A/(A + B) is close to p, then the hypergeometric distribution with parameters A, B, 
and n is close to the binomial distribution with parameters n and p. These two results 
can be combined into the following theorem, whose proof is left to Exercise 17. 


Closeness of Hypergeometric and Poisson Distributions. Let A > 0. Let Y have the 
Poisson distribution with mean 4. For each positive integer T, let Ay, By, and 
nr be integers such that lim;_,,, Ar = 00, limy_,,, Bp = &, limy_,,, np = 00, and 
limp_,.,NpAr/(Ar + Br) =. Let X7 have the hypergeometric distribution with 
parameters A;, By, andnv7. For each fixed x =0,1,..., 


PrY=x) _ 
T>oo Pr(X; =x) = 


Poisson Processes 


Customer Arrivals. In Example 5.4.3, the store owner believes that the number of 
customers that arrive in each one-hour period has the Poisson distribution with mean 
4.5. What if the owner is interested in a half-hour period or a 4-hour and 15-minute 
period? Is it safe to assume that the number of customers that arrive in a half-hour 
period has the Poisson distribution with mean 2.25? < 


In order to be sure that all of the distributions for the various numbers of arrivals 
in Example 5.4.5 are consistent with each other, the store owner needs to think about 
the overall process of customer arrivals, not just a few isolated time periods. The 
following definition gives a model for the overall process of arrivals that will allow 
the store owner to construct distributions for all the counts of customer arrivals that 
interest him as well as other useful things. 


Definition 
5.4.2 


Example 
5.4.6 


Example 
5.4.7 
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Poisson Process. A Poisson process with rate 4 per unit time is a process that satisfies 
the following two properties: 


i. The number of arrivals in every fixed interval of time of length rt has the Poisson 
distribution with mean Ar. 


ii. The numbers of arrivals in every collection of disjoint time intervals are inde- 
pendent. 


The answer to the question at the end of Example 5.4.5 will be “yes” if the store 
owner makes the assumption that customers arrive according to a Poisson process 
with rate 4.5 per hour. Here is another example. 


Radioactive Particles. Suppose that radioactive particles strike a certain target in 
accordance with a Poisson process at an average rate of three particles per minute. 
We shall determine the probability that 10 or more particles will strike the target in 
a particular two-minute period. 

In a Poisson process, the number of particles striking the target in any particular 
one-minute period has the Poisson distribution with mean 4. Since the mean num- 
ber of strikes in any one-minute period is 3, it follows that 2 = 3 in this example. 
Therefore, the number of strikes X in any two-minute period will have the Poisson 
distribution with mean 6. It can be found from the table of the Poisson distribution 
at the end of this book that Pr(X > 10) = 0.0838. <J 


Note: Generality of Poisson Processes. Although we have introduced Poisson pro- 
cesses in terms of counts of arrivals during time intervals, Poisson processes are 
actually more general. For example, a Poisson process can be used to model occur- 
rences in space as well as time. A Poisson process could be used to model telephone 
calls arriving at a switchboard, atomic particles emitted from a radioactive source, 
diseased trees in a forest, or defects on the surface of a manufactured product. The 
reason for the popularity of the Poisson process model is twofold. First, the model 
is computationally convenient. Second, there is a mathematical justification for the 
model if one makes three plausible assumptions about how the phenomena occur. 
We shall present the three assumptions in some detail after another example. 


Cryptosporidium in Drinking Water. Cryptosporidium is a genus of protozoa that oc- 
curs as small oocysts and can cause painful sickness and even death when ingested. 
Occasionally, oocysts are detected in public drinking water supplies. A concentration 
as low as one oocyst per five liters can be enough to trigger a boil-water advisory. In 
April 1993, many thousands of people became ill during a cryptosporidiosis outbreak 
in Milwaukee, Wisconsin. Different water systems have different systems for moni- 
toring protozoa occurrence in drinking water. One problem with monitoring systems 
is that detection technology is not always very sensitive. One popular technique is to 
push a large amount of water through a very fine filter and then treat the material 
captured on the filter in a way that identifies Cryptosporidium oocysts. The number 
of oocysts is then counted and recorded. Even if there is an oocyst on the filter, the 
probability can be as low as 0.1 that it will get counted. 

Suppose that, in a particular water supply, oocysts occur according to a Poisson 
process with rate A oocysts per liter. Suppose that the filtering system is capable of 
capturing all oocysts in a sample, but that the counting system has probability p of 
actually observing each oocyst that is on the filter. Assume that the counting system 
observes or misses each oocyst on the filter independently. What is the distribution 
of the number of counted oocysts from ¢ liters of filtered water? 
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Let Y be the number of oocysts in the ¢ liters (all of which make it onto the filter). 
Then Y has the Poisson distribution with mean Ar. Let X; = 1if the ith oocyst on the 
filter gets counted, and X; = Oif not. Let X be the counted number of oocysts so that 
X =X ,+---+X, if Y =y. Conditional on Y = y, we have assumed that the X; are 
independent Bernoulli random variables with parameter p, so X has the binomial 
distribution with parameters y and p conditional on Y = y. We want the marginal 
distribution of X. This can be found using the law of total probability for random 
variables (3.6.11). For x =0,1,..., 


A@) =D ably AY) 


y=0 
fee) y 
cae y! 
_ goa (paty" 3 [Ar(1 — p)P™* 
x! = (y — x)! 
__ =x (pat)? 3 [ar(l — p)] 
=e 
x! u! 
u=0 
ager (pat)* et (lp) _ o—pat (par) 
a x! 


This is easily recognized as the p.f. of the Poisson distribution with mean pat. The 
effect of losing a fraction 1 — p of the oocyst count is merely to lower the rate of the 
Poisson process from A per liter to pd per liter. 

Suppose that A =0.2 and p=0.1. How much water must we filter in order 
for there to be probability at least 0.9 that we will count at least one oocyst? The 
probability of counting at least one oocyst is 1 minus the probability of counting 
none, which is e~?*" = e~°-', So, we need t large enough so that 1 — e~°"' > 0.9, 
that is, t > 115. A typical procedure is to test 100 liters, which would have probability 
1 — e~ -02x100 — 0.86 of detecting at least one oocyst. 4 


Assumptions Underlying the Poisson Process Model 


In what follows, we shall refer to time intervals, but the assumptions can be used 
equally well for subregions of two- or three-dimensional regions or sublengths of 
a linear distance. Indeed, a Poisson process can be used to model occurrences in 
any region that can be subdivided into arbitrarily small pieces. There are three 
assumptions that lead to the Poisson process model. 

The first assumption is that the numbers of occurrences in any collection of 
disjoint intervals of time must be mutually independent. For example, even though 
an unusually large number of telephone calls are received at a switchboard during 
a particular interval, the probability that at least one call will be received during a 
forthcoming interval remains unchanged. Similarly, even though no call has been 
received at the switchboard for an unusually long interval, the probability that a call 
will be received during the next short interval remains unchanged. 

The second assumption is that the probability of an occurrence during each 
very short interval of time must be approximately proportional to the length of 
that interval. To express this condition more formally, we shall use the standard 
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mathematical notation in which o(t) denotes any function of t having the property 
that 


lim —~ =0. 5.4.8 
ae Tt ( ) 


According to (5.4.8), o(t) must be a function that approaches 0 as t — 0, and, fur- 
thermore, this function must approach 0 at a rate faster than r itself. An example of 
such a function is o(t) = t®, where a > 1. It can be verified that this function satisfies 
Eq. (5.4.8). The second assumption can now be expressed as follows: There exists a 
constant 4 > 0 such that for every time interval of length r, the probability of at least 
one occurrence during that interval has the form Ar + o(t). Thus, for every very small 
value of t, the probability of at least one occurrence during an interval of length r is 
equal to At plus a quantity having a smaller order of magnitude. 

One of the consequences of the second assumption is that the process being ob- 
served must be stationary over the entire period of observation; that is, the probability 
of an occurrence must be the same over the entire period. There can be neither busy 
intervals, during which we know in advance that occurrences are likely to be more 
frequent, nor quiet intervals, during which we know in advance that occurrences are 
likely to be less frequent. This condition is reflected in the fact that the same con- 
stant A expresses the probability of an occurrence in every interval over the entire 
period of observation. The second assumption can be relaxed at the cost of more 
complicated mathematics, but we shall not do so here. 

The third assumption is that, for each very short interval of time, the probability 
that there will be two or more occurrences in that interval must have a smaller order 
of magnitude than the probability that there will be just one occurrence. In symbols, 
the probability of two or more occurrences in a time interval of length t must be 
o(t). Thus, the probability of two or more occurrences in a small interval must be 
negligible in comparison with the probability of one occurrence in that interval. Of 
course, it follows from the second assumption that the probability of one occurrence 
in that same interval will itself be negligible in comparison with the probability of no 
occurrences. 

Under the preceding three assumptions, it can be shown that the process will 
satisfy the definition of a Poisson process with rate 1. See Exercise 16 in this section 
for one method of proof. 


%, 
“ 


Summary 


Poisson distributions are used to model data that arrive as counts. A Poisson process 
with rate A is a model for random occurrences that have a constant expected rate A 
per unit time (or per unit area). We must assume that occurrences in disjoint time 
intervals (or disjoint areas) are independent and that two or more occurrences cannot 
happen at the same time (or place). The number of occurrences in an interval of 
length (or area of size) t has the Poisson distribution with mean fd. If n is large and 
p is small, then the binomial distribution with parameters n and p is approximately 
the same as the Poisson distribution with mean np. 
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Exercises 


1. In Example 5.4.7, with 4 =0.2 and p =0.1, compute 
the probability that we would detect at least two oocysts 
after filtering 100 liters of water. 


2. Suppose that on a given weekend the number of acci- 
dents at a certain intersection has the Poisson distribution 
with mean 0.7. What is the probability that there will be at 
least three accidents at the intersection during the week- 
end? 


3. Suppose that the number of defects on a bolt of cloth 
produced by a certain process has the Poisson distribution 
with mean 0.4. If a random sample of five bolts of cloth is 
inspected, what is the probability that the total number of 
defects on the five bolts will be at least 6? 


4. Suppose that in a certain book there are on the average 
A misprints per page and that misprints occurred accord- 
ing to a Poisson process. What is the probability that a 
particular page will contain no misprints? 


5. Suppose that a book with n pages contains on the av- 
erage 4 misprints per page. What is the probability that 
there will be at least m pages which contain more than k 
misprints? 


6. Suppose that a certain type of magnetic tape contains 
on the average three defects per 1000 feet. What is the 
probability that a roll of tape 1200 feet long contains no 
defects? 


7. Suppose that on the average a certain store serves 15 
customers per hour. What is the probability that the store 
will serve more than 20 customers in a particular two-hour 
period? 


8. Suppose that X, and X> are independent random vari- 
ables and that X; has the Poisson distribution with mean 
A; (i =1, 2). For each fixed value of k (k = 1, 2,...), de- 
termine the conditional distribution of X, given that X; + 
X,=k. 


9. Suppose that the total number of items produced by 
a certain machine has the Poisson distribution with mean 
A, all items are produced independently of one another, 
and the probability that any given item produced by the 
machine will be defective is p. Determine the marginal 
distribution of the number of defective items produced by 
the machine. 


10. For the problem described in Exercise 9, let X denote 
the number of defective items produced by the machine, 
and let Y denote the number of nondefective items pro- 
duced by the machine. Show that X and Y are independent 
random variables. 


11. The mode of a discrete distribution was defined in 
Exercise 12 of Sec. 5.2. Determine the mode or modes of 
the Poisson distribution with mean A. 


12. Suppose that the proportion of colorblind people in 
a certain population is 0.005. What is the probability that 
there will not be more than one colorblind person in a 
randomly chosen group of 600 people? 


13. The probability of triplets in human births is approx- 
imately 0.001. What is the probability that there will be 
exactly one set of triplets among 700 births in a large hos- 
pital? 


14. An airline sells 200 tickets for a certain flight on an 
airplane that has only 198 seats because, on the average, 
1 percent of purchasers of airline tickets do not appear 
for the departure of their flight. Determine the probability 
that everyone who appears for the departure of this flight 
will have a seat. 


15. Suppose that internet users access a particular Web 
site according to a Poisson process with rate 4 per hour, 
but 4 is unknown. The Web site maintainer believes that 
A has a continuous distribution with p.d-f. 


2e-* for dr > 0, 
0 otherwise. 


ra) =| 


Let X be the number of users who access the Web 
site during a one-hour period. If X = 1 is observed, find 
the conditional p.d-f. of A given X = 1. 


16. In this exercise, we shall prove that the three assump- 
tions underlying the Poisson process model do indeed 
imply that occurrences happen according to a Poisson 
process. What we need to show is that, for each r, the 
number of occurrences during a time interval of length r 
has the Poisson distribution with mean At. Let X stand for 
the number of occurrences during a particular time inter- 
val of length t. Feel free to use the following extension of 
Eq. (5.4.7): For all real a, 


le (5.4.9) 


a 
=e, 


lim (1 + au + o(u)) 
u—>0 
a. For each positive integer n, divide the time interval 
into n disjoint subintervals of length t/n each. For 
i=1,...,n,let ¥; =1if exactly one arrival occurs in 
the ith subinterval, and let A; be the event that two or 
more occurrences occur during the ith subinterval. 
Let W,, = )0)_, Y;. For each nonnegative integer k, 
show that we can write Pr(X =k) = Pr(W, =k) + 
Pr(B), where B is a subset of U?_, Aj. 


b. Show that lim,_,,. Pr(U/_,A;) = 0. Hint: Show that 
Pr(_ As) = + o(u))/" where u = 1/n. 

c. Show that lim,_,5. Pr(W,, =k) =e7*(At)‘/k!. Hint: 
lim, . 99 2!/[n*(n — k)!]=1. 

d. Show that X has the Poisson distribution with mean 
At. 
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17. Prove Theorem 5.4.6. One approach is to adapt the You'll need a couple more such limits as well. (iii) Instead 
proof of Theorem 5.3.4 by replacing n by 1 in that proof. of (5.3.12), prove that 
The steps of the proof that are significanlty different are 


the following. (i) You will need to show that By — n7 goes 
to oo. (ii) The three limits that depend on Theorem 5.3.3 i 
need to be rewritten as ratios converging to 1. For exam- T>00 (Ar + Br)"t 


X ax plr—-x 
nA, Br _ 


ple, the second one is rewritten as 


: B 
lim (- 
Toxo \ Bp -—np +x 


Example 
5.5.1 


Theorem 
5.5.1 


) Br-—ny7+x+1/2 


18. Let A;, By, and n,7 be sequences, all three of which go 
snptx to co as T > oo. Prove that limy_,,, np Ar/(Ar + Br) =’ 


if and only if limy_,,,n7A7/Br = 2. 


5.5 The Negative Binomial Distributions 


Earlier we learned that, in n Bernoulli trials with probability of success p, the 
number of successes has the binomial distribution with parameters n and p. Instead 
of counting successes in a fixed number of trials, it is often necessary to observe 
the trials until we see a fixed number of successes. For example, while monitoring 
a piece of equipment to see when it needs maintenance, we might let it run until it 
produces a fixed number of errors and then repair it. The number of failures until 
a fixed number of successes has a distribution in the family of negative binomial 
distributions. 


Definition and Interpretation 


Defective Parts. Suppose that a machine produces parts that can be either good or 
defective. Let X; = 1 if the ith part is defective and X; = 0 otherwise. Assume that 
the parts are good or defective independently of each other with Pr(X; = 1) = p for 
all i. An inspector observes the parts produced by this machine until she sees four 
defectives. Let X be the number of good parts observed by the time that the fourth 
defective is observed. What is the distribution of X? < 


The problem described in Example 5.5.1 is typical of a general situation in which 
a sequence of Bernoulli trials can be observed. Suppose that an infinite sequence 
of Bernoulli trials is available. Call the two possible outcomes success and failure, 
with p being the probability of success. In this section, we shall study the distribution 
of the total number of failures that will occur before exactly r successes have been 
obtained, where r is a fixed positive integer. 


Sampling until a Fixed Number of Successes. Suppose that an infinite sequence of 
Bernoulli trials with probability of success p are available. The number X of failures 
that occur before the rth success has the following p.d_-f.: 


PX = ") rf] — nik = 
ae p= Pp forr=012-.., (554) 
0 otherwise. 
Proof Forn =r,r+1,..., weshalllet A, denote the event that the total number of 


trials required to obtain exactly r successes is n. As explained in Example 2.2.8, the 
event A,, will occur if and only if exactly r — 1 successes occur among the first n — 1 
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Definition 
5.5.1 


Example 
5.5.2 


Definition 


5.5.2 


Example 
5.5.3 


Theorem 
5.5.2 


trials and the rth success is obtained on the nth trial. Since all trials are independent, 
it follows that 


Pr(A,) = (" 7 i) pd=py oops (" : i) p’(1— py". (5.5.2) 

r—-—1 r—1 
For each value of x (x = 0, 1, 2, .. .), the event that exactly x failures are obtained 
before the rth success is obtained is the same as the event that the total number 
of trials required to obtain r successes is r + x. In other words, if X denotes the 
number of failures that will occur before the rth success is obtained, then Pr(X = 
x) = Pr(A,,,). Eq. (5.5.1) now follows from Eq. (5.5.2). rT] 


Negative Binomial Distribution. A random variable X has the negative binomial dis- 
tribution with parameters r and p (r =1,2,... and 0 < p <1) if X has a discrete 
distribution for which the p.f. f(x|r, p) is as specified by Eq. (5.5.1). 


Defective Parts. Example 5.5.1 is worded so that defective parts are successes and 
good parts are failures. The distribution of the number X of good parts observed by 
the time of the fourth defective is the negative binomial distribution with parameters 
4 and p. «J 


The Geometric Distributions 


The most common special case of a negative binomial random variable is one for 
which r = 1. This would be the number of failures until the first success. 


Geometric Distribution. A random variable X has the geometric distribution with 
parameter p (0 < p <1)if X has a discrete distribution for which the p.f. f(x|1, p) is 
as follows: 


fxll, p)= (5.5.3) 


cl forx =0,1,2,..., 
0 otherwise. 

Triples in the Lottery. A common daily lottery game involves the drawing of three 
digits from 0 to 9 independently with replacement and independently from day to 
day. Lottery watchers often get excited when all three digits are the same, an event 
called triples. If p is the probability of obtaining triples, and if X is the number of 
days without triples before the first triple is observed, then X has the geometric 
distribution with parameter p. In this case, it is easy to see that p = 0.01, since there 
are 10 different triples among the 1000 equally likely daily numbers. 4 


The relationship between geometric and negative binomial distributions goes 
beyond the fact that the geometric distributions are special cases of negative binomial 
distributions. 


If X;,..., X, arei.i.d. random variables and if each X; has the geometric distribution 
with parameter p, then the sum X; +---+ X,. has the negative binomial distribution 
with parameters r and p. 


Proof Consider an infinite sequence of Bernoulli trials with success probability p. 
Let X, denote the number of failures that occur before the first success is obtained; 
then X, will have the geometric distribution with parameter p. 

Now continue observing the Bernoulli trials after the first success. For j = 
2,3,..., let X : denote the number of failures that occur after j — 1 successes have 


Theorem 
5.5.3 


Theorem 
5.5.4 
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been obtained but before the jth success is obtained. Since all the trials are indepen- 
dent and the probability of obtaining a success on each trial is p, it follows that each 
random variable X ; will have the geometric distribution with parameter p and that 
the random variables X,, X>, ... will be independent. Furthermore, forr =1,2,..., 
the sum X; +---+ X,. will be equal to the total number of failures that occur before 
exactly r successes have been obtained. Therefore, this sum will have the negative 
binomial distribution with parameters r and p. rT] 


Properties of Negative Binomial and Geometric Distributions 


Moment Generating Function. If X has the negative binomial distribution with param- 
eters r and p, then the m.g.f. of X is as follows: 


a a = 
vo=-(— 2) fort <1og (+). (5.5.4) 


The m.g.f. of the geometric distribution with parameter p is the special case of 
Eq. (5.5.4) with r = 1. 


Proof Let X;,..., X,bearandom sample ofr geometric random variables each with 

parameter p. We shall find the m.g.f. of X, and then apply Theorems 4.4.4 and 5.5.2 

to find the m.g.f. of the negative binomial distribution with parameters r and p. 
The m.g.f. y(t) of Xj, is 


[oe 
wit) = E(e*1) = p ) [1 — poe’. (5.5.5) 
x=0 
The infinite series in Eq. (5.5.5) will have a finite sum for every value of ¢ such that 


0<(1— p)e’ <1, that is, for t < log(1/[1 — p]). Itis known from elementary calculus 
that for every number a (0 <a <1), 


1 
» x 
x=0 


1l-a 


Therefore, for ¢ < log(1/[1 — p]), the m.g.f. of the geometric distribution with param- 
eter p is 


P 
t) = ———_.. 5.5.6 
Wi) i-d—pet (5.5.6) 

Each of X;,..., X; has the same m.g.f., namely, w,. According to Theorem 4.4.4, 
the m.g.f. of X = X,+---+ X,is W(t) =[W(/]’. Theorem 5.5.2 says that X has the 
negative binomial distribution with parameters r and p, and hence the m.g.f. of X is 
[w1(t)]’, which is the same as Eq. (5.5.4). rT] 


Mean and Variance. If X has the negative binomial distribution with parameters r and 
p, the mean and the variance of X must be 


pepe 2 aad: arg (5.5.7) 
P P 


The mean and variance of the geometric distribution with parameter p are the special 
case of Eq. (5.5.7) with r = 1. 
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Proof Let X, have the geometric distribution with parameter p. We will find the 
mean and variance by differentiating the m.g.f. Eq. (5.5.5): 


L= 
E(X1)) =ViQ) = a (5.5.8) 


1 
Var(X1) = ¥1/(0) — [Wj OP = a (5.5.9) 
If X has the negative binomial distribution with parameters r and p, represent it as 


the sum X = X,;+---+X, ofr independent random variables, each having the same 
distribution as X,. Eq. (5.5.7) now follows from Eqs. (5.5.8) and (5.5.9). | 


Triples in the Lottery. In Example 5.5.3, the number X of daily draws without a triple 
until we see a triple has the geometric distribution with parameter p = 0.01. The total 
number of days until we see the first triple is then X + 1. So, the expected number of 
days until we observe triples is E(X) + 1 = 100. 

Now suppose that a lottery player has been waiting 120 days for triples to occur. 
Such a player might conclude from the preceeding calculation that triples are “due.” 
The most straightforward way to address such a claim would be to start by calculating 
the conditional distribution of X given that X > 120. <1 


The next result says that the lottery player at the end of Example 5.5.4 couldn’t 
be farther from correct. Regardless of how long he has waited for triples, the time 
remaining until triples occur has the same geometric distribution (and the same 
mean) as it had when he started waiting. The proof is simple and is left as Exercise 8. 


Memoryless Property of Geometric Distributions. Let X have the geometric distribution 
with parameter p, and let k > 0. Then for every integer r > 0, 


Pr(X =k4t|X >‘ =Pr(X =D). . 


The intuition behind Theorem 5.5.5 is the following: Think of X as the number of 
failures until the first success in a sequence of Bernoulli trials. Let Y be the number 
of failures starting with the k + 1st trial until the next success. Then Y has the same 
distribution as X and is independent of the first k trials. Hence, conditioning on 
anything that happened on the first k trials, such as no successes yet, doesn’t affect 
the distribution of Y—tt is still the same geometric distribution. A formal proof can 
be given in Exercise 8. In Exercise 13, you can prove that the geometric distributions 
are the only discrete distributions that have the memoryless property. 


Triples in the Lottery. In Example 5.5.4, after the first 120 non-triples, the process 
essentially starts over again and we still have to wait a geometric amount of time 
until the first triple. 

At the beginning of the experiment, the expected number of failures (non- 
triples) that will occur before the first success (triples) is (1 — p)/p, as given by 
Eq. (5.5.8). If it is known that failures were obtained on the first 120 trials, then the 
conditional expected total number of failures before the first success (given the 120 
failures on the first 120 trials) is simply 120 + (1 — p)/p. eal 


, 
“ 
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Extension of Definition of Negative Binomial Distributon 


By using the definition of binomial coefficients given in Eq. (5.3.14), the function 
f (x|r, p) can be regarded as the p.f. of a discrete distribution for each number r > 0 
(not necessarily an integer) and each number p in the interval 0 < p < 1. In other 
words, it can be verified that for r > 0 and 0 < p <1, 


= (rx = 1 

pa ( ora — py =1, (5.5.10) 
x=0 a 
X< 


¢ 


Summary 


If we observe a sequence of independent Bernoulli trials with success probability p, 
the number of failures until the rth success has the negative binomial distribution 
with parameters r and p. The special case of r = 1 is the geometric distribution with 
parameter p. The sum of independent negative binomial random variables with the 


same second parameter p has a negative binomial distribution. 


Exercises 


1. Consider a daily lottery as described in Example 5.5.4. 


a. Compute the probability that two particular days in 
a row will both have triples. 


b. Suppose that we observe triples on a particular day. 
Compute the conditional probability that we observe 
triples again the next day. 


2. Suppose that a sequence of independent tosses are 
made with a coin for which the probability of obtaining a 
head on each given toss is 1/30. 


a. What is the expected number of tails that will be 
obtained before five heads have been obtained? 


b. What is the variance of the number of tails that will 
be obtained before five heads have been obtained? 


3. Consider the sequence of coin tosses described in Ex- 
ercise 2. 


a. What is the expected number of tosses that will be 
required in order to obtain five heads? 


b. What is the variance of the number of tosses that will 
be required in order to obtain five heads? 


4. Suppose that two players A and B are trying to throw a 
basketball through a hoop. The probability that player A 
will succeed on any given throw is p, and he throws until 
he has succeeded r times. The probability that player B 
will succeed on any given throw is mp, where m is a given 


integer (m = 2, 3,...) such that mp < 1, and she throws 
until she has succeeded mr times. 


a. For which player is the expected number of throws 
smaller? 


b. For which player is the variance of the number of 
throws smaller? 


5. Suppose that the random variables X1,..., X; are in- 
dependent and that X; has the negative binomial distribu- 
tion with parameters r; and p (i =1...k). Prove that the 
sum X,-+---+ X;, has the negative binomial distribution 
with parameters r =r, +---+7r;, and p. 


6. Suppose that X has the geometric distribution with 
parameter p. Determine the probability that the value of 
X will be one of the even integers 0,2, 4,.... 


7. Suppose that X has the geometric distribution with 
parameter p. Show that for every nonnegative integer k, 
Pr(X >k) =(1— p)*. 


8. Prove Theorem 5.5.5. 


9. Suppose that an electronic system contains n compo- 
nents that function independently of each other, and sup- 
pose that these components are connected in series, as 
defined in Exercise 5 of Sec. 3.7. Suppose also that each 
component will function properly for a certain number 
of periods and then will fail. Finally, suppose that for i = 
1,...,”, the number of periods for which component i 
will function properly is a discrete random variable having 
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a geometric distribution with parameter p;. Determine the 
distribution of the number of periods for which the system 
will function properly. 


10. Let f(x|r, p) denote the p.f. of the negative binomial 
distribution with parameters r and p, and let f(x|A) de- 
note the p.f. of the Poisson distribution with mean i, as 
defined by Eq. (5.4.2). Suppose r > oo and p > 1in such 
a way that the value of r(1 — p) remains constant and is 
equal to A throughout the process. Show that for each fixed 
nonnegative integer x, 


F(xlr, p) > fla). 


11. Prove that the p.f. of the negative binomial distribu- 
tion can be written in the following alternative form: 


caine (<')p'(-[1- p)* forx =0,1,2,..., 
0 


otherwise. 
Hint: Use Exercise 10 in Sec. 5.3. 


12. Suppose that a machine produces parts that are defec- 
tive with probability P, but P is unknown. Suppose that 


P has a continuous distribution with p.d-f. 


f(p)= | 10(1— p)? if0< p< 1, 

0 otherwise. 
Conditional on P = p, assume that all parts are indepen- 
dent of each other. Let X be the number of nondefective 
parts observed until the first defective part. If we observe 
X = 12, compute the conditional p.d.f. of P given X = 12. 


13. Let F be the c.d-f. of a discrete distribution that has 
the memoryless property stated in Theorem 5.5.5. Define 
L(x) = log[1 — F(x — 1)] for x =1,2,.... 


a. Show that, for all integers t, h > 0, 
1-Ft+h-\ 


1-F-D= rao 


b. Prove that €(¢ + h) = €(t) + €(h) for allintegerst, h > 
0. 


c. Prove that €(r) = té(1) for every integer t > 0. 


d. Prove that F must be the c.d.f. of a geometric distri- 
bution. 


5.6 The Normal Distributions 


The most widely used model for random variables with continuous distributions is 
the family of normal distributions. These distributions are the first ones we shall see 
whose p.d.f.’s cannot be integrated in closed form, and hence tables of the c.d.f. or 
computer programs are necessary in order to compute probabilities and quantiles 


for normal distributions. 


Importance of the Normal Distributions 


Example 
5.6.1 


Automobile Emissions. Automobile engines emit a number of undesirable pollutants 
when they burn gasoline. Lorenzen (1980) studied the amounts of various pollutants 
emitted by 46 automobile engines. One class of polutants consists of the oxides of 
nitrogen. Figure 5.1 shows a histogram of the 46 amounts of oxides of nitrogen (in 
grams per mile) that are reported by Lorenzen (1980). The bars in the histogram 
have areas that equal the proportions of the sample of 46 measurements that lie 
between the points on the horizontal axis where the sides of the bars stand. For 
example, the fourth bar (which runs from 1.0 to 1.2 on the horizontal axis) has 
area 0.870 x 0.2 = 0.174, which equals 8/46 because there are eight observations 
between 1.0 and 1.2. When we want to make statements about probabilities related 
to emissions, we will need a distribution with which to model emissions. The family of 
normal distributions introduced in this section will prove to be valuable in examples 
such as this. | 


The family of normal distributions, which will be defined and discussed in this 
section, is by far the single most important collection of probability distributions 


Figure 5.1 Histogram 

of emissions of oxides of 
nitrogen for Example 5.6.1 
in grams per mile over a 
common driving regimen. 


Definition 
5.6.1 
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Proportion 


0 05 1.0 15 2.0 25 3.0 
Oxides of nitrogen 


in statistics. There are three main reasons for this preeminent position of these 
distributions. 

The first reason is directly related to the mathematical properties of the normal 
distributions. We shall demonstrate in this section and in several later sections of this 
book that if a random sample is taken from a normal distribution, then the distribu- 
tions of various important functions of the observations in the sample can be derived 
explicitly and will themselves have simple forms. Therefore, it is a mathematical con- 
venience to be able to assume that the distribution from which a random sample is 
drawn is a normal distribution. 

The second reason is that many scientists have observed that the random vari- 
ables studied in various physical experiments often have distributions that are ap- 
proximately normal. For example, a normal distribution will usually be a close ap- 
proximation to the distribution of the heights or weights of individuals in a homoge- 
neous population of people, corn stalks, or mice, or to the distribution of the tensile 
strength of pieces of steel produced by a certain process. Sometimes, a simple trans- 
formation of the observed random variables has a normal distribution. 

The third reason for the preeminence of the normal distributions is the central 
limit theorem, which will be stated and proved in Sec. 6.3. Ifa large random sample is 
taken from some distribution, then even though this distribution is not itself approx- 
imately normal, a consequence of the central limit theorem is that many important 
functions of the observations in the sample will have distributions which are approx- 
imately normal. In particular, for a large random sample from any distribution that 
has a finite variance, the distribution of the average of the random sample will be 
approximately normal. We shall return to this topic in the next chapter. 


Properties of Normal Distributions 


Definition and p.d.f. A random variable X has the normal distribution with mean ju 
and variance 0” (—0o < jt < oo ando > 0) if X has a continuous distribution with the 
following p.d.f.: 


oO 


1 1 (x=—py 
Flu, 0") = Sane el (* “)) for —co<x<oo. (5.6.1) 
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We should first verify that the function defined in Eq. (5.6.1) is a p.d.f. Shortly 
thereafter, we shall verify that the mean and variance of the distribution with p.d.f. 
(5.6.1) are indeed yz and o?, respectively. 


The function defined in Eq. (5.6.1) is a p.d.f. 


Proof Clearly, the function is nonnegative. We must also show that 


[ feline deal, (5.6.2) 


If we let y = (x — 2)/o, then 
lore) Fl Hix ioe) 1 1 ‘ j 
Le f(x|M, 0°) dx = ne =5e ee 


We shall now let 
T= exp( —~y~ } dy. (5.6.3) 
poen 2 


Then we must show that J = (27) "/?. 
From Eq. (5.6.3), it follows that 


2 = L = La 
=1-l= exp -5 dy exp ae dz 
—0o —0o 


= exp| —=(° + 2°) | dy dz. 
See 65 2 


We shall now change the variables in this integral from y and z to the polar coordi- 
nates r and 6 by letting y =r cos @ and z=r sin@. Then, since y* + z* =r”, 


2n lo.e) 1 
P= / / exp (-3°) rdrd0=2n, (5.6.4) 
0 0 


where the inner integral in (5.6.4) is performed by substituting v = r?/2 withdv = rdr, 
so the inner integral is 


CO 
/ exp(—v)dv = 1, 
0 


and the outer integral is 27. Therefore, J = (277)? and Eq. (5.6.2) has been estab- 
lished. 7 


Automobile Emissions. Consider the automobile engines described in Example 5.6.1. 
Figure 5.2 shows the histogram from Fig. 5.1 together with the normal p.d.f. having 
mean and variance chosen to match the observed data. Although the p.d.f. does not 
exactly match the shape of the histogram, it does correspond remarkably well. < 


We could verify directly, using integration by parts, that the mean and variance 
of the distribution with p.d.f. given by Eq. (5.6.1) are, respectively, . and o7. (See 
Exercise 26.) However, we need the moment generating function anyway, and then 
we can just take two derivatives of the m.g.f. to find the first two moments. 


Moment Generating Function. The m.g.f. of the distribution with p.d.f given by 
Eq. (5.6.1) is 


v(t) = exp (1 + 50°") for —co<t<o. (5.6.5) 


Figure 5.2 Histogram 
of emissions of oxides of 
nitrogen for Example 5.6.2 
together with a matching 
normal p.d.f. 
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Proportion 


0 05 1.0 15 2.0 25 3.0 
Oxides of nitrogen 


Proof By the definition of an m.g.f., 


ioe) 1 _ rs 
W(t) = E(e'*) = [. Qn)'P2o cxf = aoe) dx. 


By completing the square inside the brackets (see Exercise 24), we obtain the relation 


(x — wy? 1. [e-(u+o7t)P 
tx — ———— =pt+ t : 
2 ge 2o2 
Therefore, 
w(t)=C exp(u + 50°r), 
where 


f° 4 [x —(u +071) 
C =i (On) io en 752 dx 


If we now replace yz with yz + 071 in Eq. (5.6.1), it follows from Eq. (5.6.2) that C = 1. 
Hence, the m.g.f. of the normal distribution is given by Eq. (5.6.5). a 


We are now ready to verify the mean and variance. 


Mean and Variance. The mean and variance of the distribution with p.d.f. given by 
Eq. (5.6.1) are zp and o”, respectively. 


Proof The first two derivatives of the m.g.f. in Eq. (5.6.5) are 
fey 2 ee 
W(t) = (u +o t) exp (1 + a t 


" 1 
v(t= ([u + otf + 0) exp (i + 0°?) 
Plugging t = 0 into each of these derivatives yields 
E(X)=wW'0)=mu and Var(X)=w"(0) —[W'() fF =o7. = 


Since the m.g.f. y(t) is finite for all values of t, all the moments E(X*) (k = 
1, 2, ...) will also be finite. 
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Figure 5.3 The p.d-f. of a 
normal distribution. 
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Stock Price Changes. A popular model for the change in the price of a stock over a 
period of time of length uw is to say that the price after time u is S, = Sge“", where 
Z,, has the normal distribution with mean ju and variance o7u. In this formula, So 
is the present price of the stock, and o is called the volatility of the stock price. The 
expected value of S, can be computed from the m.g.f. y of Z,: 


E(S,) = SpE (e”*) = Soy(1) = Spe to"?, < 


The Shapes of Normal Distributions It can be seen from Eq. (5.6.1) that the p.d-f. 
f (x|u, 07) of the normal distribution with mean jz and variance o? is symmetric 
with respect to the point x = w. Therefore, yw is both the mean and the median 
of the distribution. Furthermore, ju is also the mode of the distribution. In other 
words, the p.d.f. f(x|u, 07) attains its maximum value at the point x = ju. Finally, by 
differentiating f (x|, 07) twice, it can be found that there are points of inflection at 
x=u+oandatx=pu-o. 

The p.d.f. f(x|, 07) is sketched in Fig. 5.3. It is seen that the curve is “bell- 
shaped.” However, it is not necessarily true that every arbitrary bell-shaped p.d.f. 
can be approximated by the p.d.f. of a normal distribution. For example, the p.d.f. of 
a Cauchy distribution, as sketched in Fig. 4.3, is a symmetric bell-shaped curve which 
apparently resembles the p.d-f. sketched in Fig. 5.3. However, since no moment of 
the Cauchy distribution—not even the mean—exists, the tails of the Cauchy p.d.f. 
must be quite different from the tails of the normal p.d.f. 


Linear Transformations We shall now show that if a random variable X has a nor- 
mal distribution, then every linear function of X will also have a normal distribution. 


If X has the normal distribution with mean yw and variance o” and if Y=aX +), 


where a and bare given constants and a ¥ 0, then Y has the normal distribution with 


mean aj +b and variance a’o?. 


Proof The m.g.f. w of X is given by Eq. (5.6.5). If wy denotes the m.g.f. of Y, then 


Wy (t) =e" Wat) = exp] (an 4 byt ser] for —0o <t < oo. 
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By comparing this expression for wy with the m.g.f. of a normal distribution given in 
Eq. (5.6.5), we see that yy is the m.g.f. of the normal distribution with mean au +b 
and variance a2o2. Hence, Y must have this normal distribution. | 


The Standard Normal Distribution 


Standard Normal Distribution. The normal distribution with mean 0 and variance 1 is 
called the standard normal distribution. The p.d_f. of the standard normal distribution 
is usually denoted by the symbol ¢, and the c.d.f. is denoted by the symbol ®. Thus, 


o(x) = f(x|0, ) = ae exp(-3x°) for —co <x <oo (5.6.6) 
and 
P(x) = o(u)du for —o<x <M, (5.6.7) 
—oo 
where the symbol wu is used in Eq. (5.6.7) as a dummy variable of integration. 


The c.d.f. ®(x) cannot be expressed in closed form in terms of elementary 
functions. Therefore, probabilities for the standard normal distribution or any other 
normal distribution can be found only by numerical approximations or by using a 
table of values of ®(x) such as the one given at the end of this book. In that table, the 
values of ®(x) are given only for x > 0. Most computer packages that do statistical 
analysis contain functions that compute the c.d.f. and the quantile function of the 
standard normal distribution. Knowing the values of ®(x) for x > 0 and ®-!(p) for 
0.5 < p <1 is sufficient for calculating the c.d.f. and the quantile function of any 
normal distribution at any value, as the next two results show. 


Consequences of Symmetry. For all x and all 0 < p <1, 


@(—x)=1—(x) and &1(p)=-@1(1— p). (5.6.8) 


Proof Since the p.d.f. of the standard normal distribution is symmetric with respect 
to the point x = 0, it follows that Pr(X < x) = Pr(X > —x) for every number x (—oo < 
x < oo). Since Pr(X < x) = ®(x) and Pr(X > —x) =1— ®(—x), we have the first 
equation in Eq. (5.6.8). The second equation follows by letting x = ®~!(p) in the 
first equation and then applying the function ®~! to both sides of the equation. = 


Converting Normal Distributions to Standard. Let X have the normal distribution with 
mean yw and variance o”. Let F be the c.d.f. of X. Then Z =(X — )/o has the 
standard normal distribution, and, for all x and all 0 < p <1, 


F(x)=©® (: = “) (5.6.9) 
(oy 


F-\(p) =p+o0|(p). (5.6.10) 


Proof It follows immediately from Theorem 5.6.4 that Z = (X — )/o has the stan- 
dard normal distribution. Therefore, 


Fay = PX en =Pr (z= 4), 


oO 


which establishes Eq. (5.6.9). For Eq. (5.6.10), let p = F(x) in Eq. (5.6.9) and then 
solve for x in the resulting equation. a 
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Determining Probabilities for a Normal Distribution. Suppose that X has the normal 
distribution with mean 5 and standard deviation 2. We shall determine the value of 
Pr(1 < X <8). 

If we let Z = (X — 5)/2, then Z will have the standard normal distribution and 
1-5 xX-5 8-5 
< < 
2 2, 2 


Pril <x <8) =Pr( ) =Prc 2<Z<\1.5). 


Furthermore, 
Pr(—2 < Z < 1.5) =Pr(Z < 1.5) — Pr(Z < —2) 
= (1.5) — &(—2) 
= (1.5) — [1 — ®(2)]. 


From the table at the end of this book, it is found that @(1.5) = 0.9332 and ®(2) = 
0.9773. Therefore, 


Pr(1 < X <8) =0.9105. < 


Quantiles of Normal Distributions. Suppose that the engineers who collected the 
automobile emissions data in Example 5.6.1 are interested in finding out whether 
most engines are serious polluters. For example, they could compute the 0.05 quantile 
of the distribution of emissions and declare that 95 percent of the engines of the 
type tested exceed this quantile. Let X be the average grams of oxides of nitrogen 
per mile for a typical engine. Then the engineers modeled X as having a normal 
distribution. The normal distribution plotted in Fig. 5.2 has mean 1.329 and standard 
deviation 0.4844. The c.d.f. of X would then be F(x) = ®([x — 1.329]/0.4844), and 
the quantile function would be F-\(p) = 1.329 + 0.4844-!(p), where 7! is the 
quantile function of the standard normal distribution, which can be evaluated using 
a computer or from tables. To find ©~!(p) from the table of ®, find the closest value 
to pin the ®(x) column and read the inverse from the x column. Since the table only 
has values of p > 0.5, we use Eq. (5.6.8) to conclude that &~!(0.05) = —&~!(0.95). So, 
look up 0.95 in ®(x) column (halfway between 0.9495 and 0.9505) to find x = 1.645 
(halfway between 1.64 and 1.65) and conclude that &~!(0.05) = —1.645. The 0.05 
quantile of X is then 1.329 + 0.4844 x (—1.645) = 0.5322. < 


Comparisons of Normal Distributions 


The p.d.f’s of three normal distributions are sketched in Fig. 5.4 for a fixed value of 
wand three different values of o (o = 1/2, 1, and 2). It can be seen from this figure 
that the p.d-f. of a normal distribution with a small value of o has a high peak and 
is very concentrated around the mean jz, whereas the p.d.f. of a normal distribution 
with a larger value of o is relatively flat and is spread out more widely over the real 
line. 

An important fact is that every normal distribution contains the same total 
amount of probability within one standard deviation of its mean, the same amount 
within two standard deviations of its mean, and the same amount within any other 
fixed number of standard deviations of its mean. In general, if X has the normal dis- 
tribution with mean jz and variance o”, and if Z has the standard normal distribution, 
then for k > 0, 


py = Pr(iX — wl ko) =Pr(|Z| <b). 


In Table 5.2, the values of this probability p, are given for various values of k. 
These probabilities can be computed from a table of ® or using computer programs. 
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Figure 5.4 The normal p.d.f. A 
for 4 =O ando =1/2, 1, 2. 


Table 5.2 Probabilities that normal 
random variables are within 
k standard deviations of 
their means 


k Pk 


0.6826 
0.9544 
0.9974 
0.99994 
1-6~x 1077 
10 1-2x 10-33 


na FB WN FR 


Although the p.d-f. of a normal distribution is positive over the entire real line, it can 
be seen from this table that the total amount of probability outside an interval of 
four standard deviations on each side of the mean is only 0.00006. 


Linear Combinations of Normally Distributed Variables 


In the next theorem and corollary, we shall prove the following important result: 
Every linear combination of random variables that are independent and normally 
distributed will also have a normal distribution. 


Theorem If the random variables X 1, ..., X; are independent and if X; has the normal distri- 
5.6.7 bution with mean jz; and variance a? (i =1,...,k), then the sum X, +--+ + X;, has 
the normal distribution with mean jz; + --- + 4, and variance o feet ce 
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Proof Let y;(t) denote the m.g.f. of X; fori =1,...,k, and let w(t) denote the m.g.f. 
of X;+---+ X,. Since the variables X,..., X; are independent, then 


w(t) 1] Wilt) =[Teso(we +} +=o7t ’) 


weal (Sou) (2) i ieee eee 


From Eq. (5.6.5), the m.g.f. w(t) can be identified as the m.g.f. of the normal dis- 
tribution for which the mean is yy 4; and the variance is ae o?. Hence, the 
distribution of X; +----+ X; must be as stated in the theorem. a 


The following corollary is now obtained by combining Theorems 5.6.4 and 5.6.7. 


If the random variables X;,..., X, are independent, if X,; has the normal distribution 
with mean jz; and variance a? (Gi =1,...,k), and if a,,..., a, and b are constants 
for which at least one of the values aj, ..., a, is different from 0, then the variable 


a,X,+---+a,X; +b has the normal distribution with mean aju,+---+au,+b 
and variance ane So aos, o 
Heights of Men and Women. Suppose that the heights, in inches, of the women 
in a certain population follow the normal distribution with mean 65 and standard 
deviation 1, and that the heights of the men follow the normal distribution with mean 
68 and standard deviation 3. Suppose also that one woman is selected at random and, 
independently, one man is selected at random. We shall determine the probability 
that the woman will be taller than the man. 

Let W denote the height of the selected woman, and let M denote the height of 
the selected man. Then the difference W — M has the normal distribution with mean 
65 — 68 = —3 and variance 1? + 3? = 10. Therefore, if we let 


=a PR (W —M +3), 
then Z has the standard normal distribution. It follows that 


Pr(W > M) = Pr(W — M > 0) 


3 


= 1— (0.949) = 0.171. 
Thus, the probability that the woman will be taller than the man is 0.171. <I 


Averages of random samples of normal random variables figure prominently in 
many statistical calculations. To fix notation, we start with a general defintion. 


Sample Mean. Let X;,..., X, be random variables. The average of these n random 
variables, 4 - , Xj, 18 ealied their sample mean and is commonly denoted X,. 


The following simple corollary to Corollary 5.6.1 gives the distribution of the 
sample mean of a random sample of normal random variables. 
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Suppose that the random variables X,,..., X, form a random sample from the 
normal distribution with mean pz and variance o7, and let X,, denote their sample 
mean. Then X,, has the normal distribution with mean py and variance o7/n. 


Proof Since X, = )-7_,(1/n)X;, it follows from Corollary 5.6.1 that the distribution 
of X,, is normal with mean 7"_,(1/n)u = uw and variance 7"_,(1/n)*o* =07/n. 


Determining a Sample Size. Suppose that a random sample of size n is to be taken 
from the normal distribution with mean y and variance 9. (The heights of men 
in Example 5.6.6 have such a distribution with 4: = 68.) We shall determine the 
minimum value of n for which 


Pr(|X,, —pI< l= 0.95. 


It is known from Corollary 5.6.2 that the sample mean X,, will have the normal 
distribution for which the mean is x and the standard deviation is 3/n'/?. Therefore, 
if we let 
1/2 
n — 
Z= aren a HL), 

then Z will have the standard normal distribution. In this example, n must be chosen 
so that 


a 1/2 
Pr(|X, — ul) <D= Pe(i2 < | > 0.95. (5.6.11) 


For each positive number x, it will be true that Pr(|Z| < x) > 0.95 if and only if 
1— ®(x) = Pr(Z > x) < 0.025. From the table of the standard normal distribution at 
the end of this book, itis found that 1 — ®(x) < 0.025if and only if x > 1.96. Therefore, 
the inequality in relation (5.6.11) will be satisfied if and only if 


1/2 
2 S06. 
a 


Since the smallest permissible value of n is 34.6, the sample size must be at least 35 
in order that the specified relation will be satisfied. < 


Interval for Mean. Consider a popluation with a normal distribution such as the 
heights of men in Example 5.6.6. Suppose that we are not willing to specify the 
precise distribution as we did in that example, but rather only that the standard 
deviation is 3, leaving the mean jz unspecified. If we sample a number of men from 
this population, we could try to use their sampled heights to give us some idea what ju 
equals. A popular form of statistical inference that will be discussed in Sec. 8.5 finds 
an interval that has a specified probability of containing jz. To be specific, suppose 
that we observe a random sample of size n from the normal distribution with mean 
and standard deviation 3. Then, X,, has the normal distribution with mean pz and 


standard deviation 3/n!/* as in Example 5.6.7. Similarly, we can define 
1/2 
n = 
Z= qn — ), 


which then has the standard normal distribution. Hence, 


0.95 = Pr(|Z| < 1.96) = Pr (%, — p|< 1.965) ; (5.6.12) 
nN 
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Figure 5.5 Histogram of 
lifetimes of ball bearings and 
fitted lognormal p.d.f. for 
Example 5.6.9. 


It is easy to verify that 


IX, -BI< 196 if and only if 
ni/2 


= 3 = 3 
The two inequalities in Eq. (5.6.13) hold if and only if the interval 
X, — 1.96 X, + 1.96 . 5.6.14 
n— 1-707 T 79 eS 1) (5.6.14) 


contains the value of yw. It follows from Eq. (5.6.12) that the probability is 0.95 that 
the interval in (5.6.14) contains jz. Now, suppose that the sample size is n = 36. Then 
the half-width of the interval (5.6.14) is then 3/36? = 0.98. We will not know the 
endpoints of the interval until after we observe X,,. However, we know now that the 


interval (x, — 0.98, X, + 0.98) has probability 0.95 of containing ju. < 


The Lognormal Distributions 


It is very common to use normal distributions to model logarithms of random vari- 
ables. For this reason, a name is given to the distribution of the original random 
variables before transforming. 


Lognormal Distribution. If log(X) has the normal distribution with mean wy and vari- 
ance o7, we say that X has the lognormal distribution with parameters jz and o°. 


Failure Times of Ball Bearings. Products that are subject to wear and tear are gener- 
ally tested for endurance in order to estimate their useful lifetimes. Lawless (1982, 
example 5.2.2) describes data taken from Lieblein and Zelen (1956), which are mea- 
surements of the numbers of millions of revolutions before failure for 23 ball bearings. 
The lognormal distribution is one popular model for times until failure. Figure 5.5 
shows a histogram of the 23 lifetimes together with a lognormal p.d.f. with parame- 
ters chosen to match the observed data. The bars of the histogram in Fig. 5.5 have 
areas that equal the proportions of the sample that lie between the points on the 
horizontal axis where the sides of the bars stand. Suppose that the engineers are in- 
terested in knowing how long to wait until there is a 90 percent chance that a ball 
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bearing will have failed. Then they want the 0.9 quantile of the distribution of life- 
times. Let X be the time to failure of a ball bearing. The lognormal distribution of 
X plotted in Fig. 5.5 has parameters 4.15 and 0.53347. The c.d.f. of X would then be 
F(x) = ®([log(x) — 4.15]/0.5334), and the quantile function would be 


-1 
F7\(p) = 04 15+0.53340-1(p) 


where ©~! is the quantile function of the standard normal distribution. With p = 0.9, 
we get ®-!(0,9) = 1.28 and F~!(0.9) = 125.6. < 


The moments of a lognormal random variable are easy to compute based on the 
m.g.f. of a normal distribution. If Y = log(X) has the normal distribution with mean 
w and variance o”, then the m.g.f. of Y is w(t) = exp(ut + 0.50217). However, the 
definition of w is y(t) = E(e'”). Since Y = log(X), we have 


w(t) = E(e”) = E(e! 8) = E(x"), 


It follows that E(X') = w(t) for all real t. In particular, the mean and variance of X 
are 


E(X) = W()) = exp(yu + 0.507), (5.6.15) 
Var(X) = (2) — w(1)” = exp(2u + o”)[exp(o”) — 1]. 


Stock and Option Prices. Consider a stock like the one in Example 5.6.3 whose current 
price is Sp. Suppose that the price at u time units in the future is S, = Sge“", where 
Z,, has the normal distribution with mean ju and variance o7u. Note that Sye7" = 
e%u+l08(50) and Z,, + log(So) has the normal distribution with mean ju + log(Sp) and 
variance ou. So S, has the lognormal distribution with parameters jzu + log(So) and 
ou . 

Black and Scholes (1973) developed a pricing scheme for options on stocks whose 
prices follow a lognormal distribution. For the remainder of this example, we shall 
consider a single time u and write the stock price as S, = SyehutourPZ , where Z has 
the standard normal distribution. Suppose that we need to price the option to buy 
one share of the above stock for the price g at a particular time u in the future. As 
in Example 4.1.14 on page 214, we shall use risk-neutral pricing. That is, we force 
the present value of E(S,,) to equal So. If wu is measured in years and the risk-free 
interest rate is r per year, then the present value of E(S,,) ise" E(S,,). (This assumes 
that compounding of interest is done continuously instead of just once as it was in 
Example 4.1.14. The effect of continuous compounding is examined in Exercise 25.) 
But E(S,) = Spet“+?"/2, Setting Sy equal to e~/ Spetto"/2 yields wp =r — 02/2 
when doing risk-neutral pricing. 

Now we can determine a price for the specified option. The value of the option 
at time u will be h(S,,), where 

noy=|>~4 ifs >q, 


0 otherwise. 
Set u =r — 07/2, and it is easy to see that h(S,) > 0 if and only if 


log () =(¢=07/2u 


Le 
oui/2 


(5.6.16) 
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We shall refer to the constant on the right-hand side of Eq. (5.6.16) as c. The 
risk-neutral price of the option is the present value of E(h(S,,)), which equals 


ore 
—ru _ aru [r—o?/2Jutoul/?z _ 1 
e “E[h(S,)|] =e / [Soe q| onira* 


To compute the integral in Eq. (5.6.17), split the integrand into two parts at the —q. 
The second integral is then just a constant times the integral of anormal p.d.f., namely, 


—ru 7 1 —z2 —ru 
—e '"q dz=-—e "g[1— ®(c)]. 
Cc 
The first integral in Eq. (5.6.17), is 


eo uz a 1 en? f2tous dz. 
(2m) 1/2 
This can be converted into the integral of a normal p.d.f. times a constant by com- 
pleting the square (see Exercise 24). The result of completing the square is 


[o,e) 
gers, / te e-oul 2 /2407U/2 dy = Soft — O(c — ou")] 


-2/2 dz, (5.6.17) 


(21/2 
Finally, combine the two integrals into the option price, using the fact that 1 — ®(x) = 
@(—x): 
Sp®(oul/? —c)—qe '®(-c). (5.6.18) 


This is the famous Black-Scholes formula for pricing options. As a simple ex- 
ample, suppose that g = So, r = 0.06 (6 percent interest), uv = 1 (one year wait), and 
o =0.1. Then (5.6.18) says that the option price should be 0.07465). If the distribution 
of S,, is different from the form used here, simulation techniques (see Chapter 12) 
can be used to help price options. < 


The p.d.f.’s of the lognormal distributions will be found in Exercise 17 of this 
section. The c.d.f. of each lognormal distribution is easily constructed from the 
standard normal c.d.f. &. Let X have the lognormal distribution with parameters 
w and o”. Then 


Pr(xX <x) = Pri(log(X) < log(x)) = ® (ew), 
Oo 


The results from earlier in this section about linear combinations of normal random 
variables translate into results about products of powers of lognormal random vari- 
ables. Results about sums of independent normal random variables translate into 
results about products of independent lognormal random variables. 


Summary 


We introduced the family of normal distributions. The parameters of each normal 
distribution are its mean and variance. A linear combination of independent normal 
random variables has the normal distribution with mean equal to the linear combi- 
nation of the means and variance determined by Corollary 4.3.1. In particular, if X 
has the normal distribution with mean pz and variance o”, then (X — )/o has the 
standard normal distribution (mean 0 and variance 1). Probabilities and quantiles for 
normal distributions can be obtained from tables or computer programs for standard 
normal probabilities and quantiles. For example, if X has the normal distribution with 
mean and variance o?, then the c.d.f. of X is F(x) = ®([x — ]/o) and the quantile 
function of X is F~!(p) = 1 + ®~!(p)o, where ® is the standard normal c.d.f. 


Exercises 


1. Find the 0.5, 0.25, 0.75, 0.1, and 0.9 quantiles of the 
standard normal distribution. 


2. Suppose that X has the normal distribution for which 
the mean is 1 and the variance is 4. Find the value of each 
of the following probabilities: 


a. Pr(X <3) b. Pr(X > 1.5) 

e« Pr(x=1) d. Pr(2<X <5) 

e. Pr(X >0) ff. Pr(—1< X <0.5) 

g. Pr(|X|<2) h. Prd <-2x +3 <8) 


3. If the temperature in degrees Fahrenheit at a certain 
location is normally distributed with a mean of 68 degrees 
and a standard deviation of 4 degrees, what is the distri- 
bution of the temperature in degrees Celsius at the same 
location? 


4. Find the 0.25 and 0.75 quantiles of the Fahrenheit tem- 
perature at the location mentioned in Exercise 3. 


5. Let X1, X2, and X3 be independent lifetimes of memory 
chips. Suppose that each X; has the normal distribution 
with mean 300 hours and standard deviation 10 hours. 
Compute the probability that at least one of the three 
chips lasts at least 290 hours. 


6. If the m.g.f. of a random variable X is y(t) = e” for 
—0o <t < ©, what is the distribution of X? 


7. Suppose that the measured voltage in a certain electric 
circuit has the normal distribution with mean 120 and 
standard deviation 2. If three independent measurements 
of the voltage are made, what is the probability that all 
three measurements will lie between 116 and 118? 


8. Evaluate the integral {5° e3”” dx. 


9. A straight rod is formed by connecting three sections 
A, B, and C, each of which is manufactured on a different 
machine. The length of section A, in inches, has the normal 
distribution with mean 20 and variance 0.04. The length of 
section B, in inches, has the normal distribution with mean 
14 and variance 0.01. The length of section C, in inches, has 
the normal distribution with mean 26 and variance 0.04. 
As indicated in Fig. 5.6, the three sections are joined so 
that there is an overlap of 2 inches at each connection. 
Suppose that the rod can be used in the construction of an 
airplane wing if its total length in inches is between 55.7 
and 56.3. What is the probability that the rod can be used? 


=" F at 
2 2 
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Figure 5.6 Sections of the rod in Exercise 9. 
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10. If a random sample of 25 observations is taken from 
the normal distribution with mean jp and standard devia- 
tion 2, what is the probability that the sample mean will 
lie within one unit of ww? 


11. Suppose that a random sample of size n is to be taken 
from the normal distribution with mean jp and standard 
deviation 2. Determine the smallest value of n such that 


Pr(|X,, — | < 0.1) = 0.9. 


12. 


a. Sketch the c.d.f. ® of the standard normal distribu- 
tion from the values given in the table at the end of 
this book. 


b. From the sketch given in part (a) of this exercise, 
sketch the c.d.f. of the normal distribution for which 
the mean is —2 and the standard deviation is 3. 


13. Suppose that the diameters of the bolts in a large box 
follow a normal distribution with a mean of 2 centimeters 
and a standard deviation of 0.03 centimeter. Also, suppose 
that the diameters of the holes in the nuts in another large 
box follow the normal distribution with a mean of 2.02 
centimeters and a standard deviation of 0.04 centimeter. 
A bolt and a nut will fit together if the diameter of the 
hole in the nut is greater than the diameter of the bolt and 
the difference between these diameters is not greater than 
0.05 centimeter. If a bolt and a nut are selected at random, 
what is the probability that they will fit together? 


14. Suppose that on a certain examination in advanced 
mathematics, students from university A achieve scores 
that are normally distributed with a mean of 625 and a 
variance of 100, and students from university B achieve 
scores which are normally distributed with a mean of 600 
and a variance of 150. If two students from university A 
and three students from university B take this examina- 
tion, what is the probability that the average of the scores 
of the two students from university A will be greater than 
the average of the scores of the three students from univer- 
sity B? Hint: Determine the distribution of the difference 
between the two averages. 


15. Suppose that 10 percent of the people in a certain 
population have the eye disease glaucoma. For persons 
who have glaucoma, measurements of eye pressure X will 
be normally distributed with a mean of 25 and a variance 
of 1. For persons who do not have glaucoma, the pressure 
X will be normally distributed with a mean of 20 and a 
variance of 1. Suppose that a person is selected at random 
from the population and her eye pressure X is measured. 


a. Determine the conditional probability that the per- 
son has glaucoma given that X = x. 

b. For what values of x is the conditional probability in 
part (a) greater than 1/2? 
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16. Suppose that the joint p.d-f. of two random variables 
X and Y is 


2.2 
e G/DAY for —00 <x < 00 


1 
IAS 5_ 


and —co<y<o. 


Find Pr(—/2 < X + Y <2v2). 


17. Consider a random variable X having the lognormal 
distribution with parameters 4 and o”. Determine the 
p.d.f. of X. 


18. Suppose that the random variables X and Y are inde- 
pendent and that each has the standard normal distribu- 
tion. Show that the quotient X/Y has the Cauchy distri- 
bution. 


19. Suppose that the measurement X of pressure made by 
a device in a particular system has the normal distribution 
with mean yz and variance 1, where yy is the true pressure. 
Suppose that the true pressure jz is unknown but has the 
uniform distribution on the interval [5, 15]. If X =8 is 
observed, find the conditional p.d.f. of ~ given X = 8. 


20. Let X have the lognormal distribution with parame- 
ters 3 and 1.44. Find the probability that X < 6.05. 


21. Let X and Y be independent random variables such 
that log(X) has the normal distribution with mean 1.6 and 
variance 4.5 and log(Y) has the normal distribution with 
mean 3 and variance 6. Find the distribution of the product 
XY. 


22. Suppose that X has the lognormal distribution with 
parameters yw and o~. Find the distribution of 1/X. 


23. Suppose that X has the lognormal distribution with 
parameters 4.1 and 8. Find the distribution of 3X!/. 


24. The method of completing the square is used several 
times in this text. It is a useful method for combining 
several quadratic and linear polynomials into a perfect 
square plus a constant. Prove the following identity, which 
is one general form of completing the square: 


n 
> a(x — by + ex 


j=] 


( ( a 7 : - ) 
= (doa) (2 - a 
i=l a4 qj 
2 
n m Db: 
+ Yaa Set 
i=l dint 4 


4 (>: a) . 3 aby = 21] 


if rr & #0. 

25. In Example 5.6.10, we considered the effect of con- 
tinuous compounding of interest. Suppose that Sg dollars 
earn a rate of r per year componded continuously for u 
years. Prove that the principal plus interest at the end of 
this time equals Spe’. Hint: Suppose that interest is com- 
pounded n times at intervals of u/n years each. At the end 
of each of the n intervals, the principal gets multiplied by 
1+ru/n. Take the limit of the result as n + oo. 


26. Let X have the normal distribution whose p.d-f. is 
given by (5.6.6). Instead of using the m.g.f., derive the 
variance of X using integration by parts. 


5.7 The Gamma Distributions 


The family of gamma distributions is a popular model for random variables that 
are known to be positive. The family of exponential distributions is a subfamily of 
the gamma distributions. The times between successive occurrences in a Poisson 
process have an exponential distribution. The gamma function, related to the 
gamma distributions, is an extension of factorials from integers to all positive 


numbers. 


The Gamma Function 


Example 
5.7.1 


joo =| 


Mean and Variance of Lifetime of a Light Bulb. Suppopse that we model the lifetime of 
a light bulb as a continuous random variable with the following p.d.f.:: 


e* forx>0, 
0 otherwise. 


Definition 
5.7.1 


Theorem 
5.7.1 


Theorem 
5.7.2 


Example 
5.7.2 
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If we wish to compute the mean and variance of such a lifetime, we need to compute 
the following integrals: 


[oe lo.e) 
/ xe “dx, and i x2e*dx. (5.7.1) 
0 0 
These integrals are special cases of an important function that we examine next. < 


The Gamma Function. For each positive number a, let the value (a) be defined by 
the following integral: 


[oe] 
(a) =i xo le * dx. (5.7.2) 
0 
The function I defined by Eq. (5.7.2) for w > 0 is called the gamma function. 
As an example, 
[oe 
rq) -|/ e*dx=1. (5.7.3) 
0 


The following result, together with Eq. (5.7.3), shows that (a) is finite for every 
value of a > 0. 


Ifa > 1, then 
l(a) =(a@ —1)P(a—1). (5.7.4) 


Proof We shall apply the method of integration by parts to the integral in Eq. (5.7.2). 
If we let wu = x*—! and dv = e* dx, then du = (a — 1)x%~? dx and v = —e~*. There- 


fore, 
lee) [o,@) 
ra) = f udv= (uel — [ v du 
0 0 


[oe 
= [=a te Rs + (a — 1) / go Aa dz 
=0+(a-—D(a—-1). | 
For integer values of a, we have a simple expression for the gamma function. 


For every positive integer n, 


Tian)=(n—-D!. (5.7.5) 


Proof It follows from Theorem 5.7.1 that for every integer n > 2, 
Tanyy=(n-)FO-Y)=(n—-1)(n—-2)T( —- 2) 
=(n—1)(n—2)---1-TA) 
=(n—1)IT (1). 
Since (1) = 1 = 0! by Eq. (5.7.3), the proof is complete. rT 


Mean and Variance of Lifetime of a Light Bulb. The two integrals in (5.7.1) are, respec- 
tively, 1(2) = 1!= 1 and ['(3) = 2! = 2. It follows that the mean of each lifetime is 1, 
and the variance is 2 — 12 = 1. <1 


In many statistical applications, (a) must be evaluated when a is either a positive 
integer or of the form a =n + (1/2) for some positive integer n. It follows from 
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Theorem 
5.7.3 


Theorem 
5.7.4 


Example 
5.7.3 


Eq. (5.7.4) that for each positive integer n, 


oD OrO) 


Hence, it will be possible to determine the value of r(n + *) if we can evaluate 


c) 


From Eq. (5.7.2), 


(5.7.6) 


1 CO 
r(5) =f x 2e—* dy. 
2 0 


If we let x = (1/2) y? in this integral, then dx = y dy and 


r(3) a2 i? exp(-3y°) dy. 
2 0 2 


Because the integral of the p.d.f. of the standard normal distribution is equal to 1, it 
follows that 


63:7) 


= 1 
/ exp(-5»°) dy = (2n)'/?, (5.7.8) 
65 2 
Because the integrand in (5.7.8) is symmetric around y = 0, 
7 1 ’) 1 1/2 (J ) 
exp| —— dy ==(2n)'7=(=—) . 
7 e( te cd 56 1) 5 
It now follows from Eq. (5.7.7) that 
r(5) =r? (5.7.9) 


For example, it is found from Eqs. (5.7.6) and (5.7.9) that 


=O 


We present two final useful results before we introduce the gamma distributions. 


For each a > 0 and each 6 > 0, 


[o,@) 
Tr 
i x! exp(Bx)dx = Ee: (5.7.10) 
0 pe 
Proof Make the change of variables y = Bx so that x = y/B and dx =dy/B. The 
result now follows easily from Eq. (5.7.2). r 


There is a version of Stirling’s formula (Theorem 1.7.5) for the gamma function, 
which we state without proof. 


(2m) V/2x4-V2—-* 


Stirling’s Formula. lim ——————__—_ =], rT] 
x00 T(x) 
Service Times in a Queue. Fori =1,...,”, suppose that customer i in a queue must 


wait time X; for service once reaching the head of the queue. Let Z be the rate at 
which the average customer is served. A typical probability model for this situation 


Example 
5.7.4 


Definition 
5.7.2 


Example 
5.7.5 
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is to say that, conditional on Z = z, Xj,..., X, arei.i.d. with a distribution having the 
conditional p.d.f. g)(x;|z) = z exp(—zx;) for x; > 0. Suppose that Z is also unknown 
and has the p.d-f. f5(z) = 2 exp(—2z) for z > 0. The joint p.d-f. of X1,..., X,, Z is 
then 
n 
FC. %n2=| [O1DAO 
i=l 
= 22" exp (—z [2 +2, +---+x,]), (5.7.11) 


if z, xy, ..., x, > 0 and 0 otherwise. In order to calculate the marginal joint distribu- 
tion of X;,..., X,, we must integrate z out of the the joint p.d.f. above. We can apply 
Theorem 5.7.3 witha =n +1and B =2+4 x; +---+-., together with Theorem 5.7.2 
to integrate the function in Eq. (5.7.11). The result is 


2(n!) 
(24004) 


for all x; > 0 and 0 otherwise. This is the same joint p.d.f. that was used in Exam- 
ple 3.7.5 on page 154. J 


[ fy +++ Xn, Zdz= (5.7.12) 
0 


The Gamma Distributions 


Service Times ina Queue. In Example 5.7.3, suppose that we observe the service times 
of n customers and want to find the conditional distribution of the rate Z. We can 
easily find the conditional p.d-f. g9(z|x1, ...,x,) of Z given X;=x1,..., X, =x, by 
dividing the joint p.d-f. of X1,..., X,, Z in Eq. (5.7.11) by the p.d-f. of X;,..., X,, in 
Eq. (5.7.12). The calculation is simplified by defining y = 2 + }>"_, x;. We then obtain 


n+1 
-yz 


e for z > 0, < 


OC ee en 


otherwise. 


Distributions with p.d.f.’s like the one at the end of Example 5.7.4 are members 
of a commonly used family, which we now define. 


Gamma Distributions. Let a and £ be positive numbers. A random variable X has the 
gamma distribution with parameters a and f if X has a continuous distribution for 
which the p.d-f. is 


po 
xe le-P*  forx = 0, 


fale, B)= 4 T@) (5.7.13) 


0 for x <0. 


That the integral of the p.d.f. in Eq. (5.7.13) is 1 follows easily from Theorem 5.7.3. 


Service Times in a Queue. In Example 5.7.4, we can easily recognize the conditional 
p.d.f. as the p.d-f. of the gamma distribution with parameters a=n+1 and B=y. 
< 


If X has a gamma distribution, then the moments of X are easily found from 
Eggs. (5.7.13) and (5.7.10). 
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Figure 5.7 Graphs of the 
p.d.f’s of several different 
gamma distributions with 

common mean of 1. 


Theorem 
5.7.5 


Example 
5.7.6 


Theorem 
5.7.6 
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Gamma p.d.f. 


Moments. Let X have the gamma distribution with parameters a and f. For k = 
1,252.09 


Ta@+k) a@+i1)---@+k—1) 
BT(a) Bk 


and Var(X) = BE 


E(X) = 
In particular, E(X) = co 
Proof Fork =1,2,..., 
CO p® [o.@) 

E(x‘) = / x* f(xla, B)dx = | pathy Br oy 

0 P@) Jo 

= Be T(a+k) = T(a+k) 
@). pe BIT(@) | 


The expression for E(X) follows immediately from (5.7.14). The variance can be 
computed as 


(5.7.14) 


a(a + 1) av a 
Var(X) = RB (<) = Bo | 


Figure 5.7 shows several gamma distribution p.d.f’s that all have mean equal to 
1 but different values of a and B. 


Service Times in a Queue. In Example 5.7.5, the conditional mean service rate given 
the observations X; = x1,..., X, =X, 18 
n+1 
n i 
2 + 2 a Xj 
For large n, the conditional mean is approximately 1 over the sample average of 


the service times. This makes sense since 1 over the average service time is what we 
generally mean by service rate. < 


E(2Z|x1, ..-5 Xp) = 


The m.g.f. y% of X can be obtained similarly. 


Moment Generating Function. Let X have the gamma distribution with parameters a 
and f. The m.g.f. of X is 
B 


w(t)= (<4) fort < B. (5.7.15) 
Bp-t 


Theorem 
5.7.7 


Definition 
5.7.3 


Theorem 
5.7.8 
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Proof The m.g.f. is 
oO p% (oe) 
vr) =} e'* f (xla, B)dx = / xe 1e- (Bx dx. 
0 Pa) Jo 


This integral will be finite for every value of f such that t < 6. Therefore, it follows 
from Eq. (5.7.10) that, for t < B, 


_ 6  T@) =( B ) 
VOT @ Boe Bat) . 


We can now show that the sum of independent random variables that have 
gamma distributions with a common value of the parameter 6 will also have a gamma 
distribution. 


If the random variables X;,..., X; are independent, and if X; has the gamma 
distribution with parameters a; and 6 (i =1,...,k), then the sum X,+---+ X; 
has the gamma distribution with parameters a; + ---+ a, and f. 


Proof If w; denotes the m.gf. of X;, then it follows from Eq. (5.7.15) that for 
(eee 
py 
;(t) = | —— fort < B. 
Wi) ( -- -) p 
If y denotes the m.g.f. of the sum X; +---+ X;, then by Theorem 4.4.4, 


k B Oyt+r+++Qy 

vo =T]u=(55) fort <8. 
i=l 

The m.g.f. yw can now be recognized as the m.g.f. of the gamma distribution with 

parameters a; +---+a,and 8. Hence, the sum X; +---+ X;, must have this gamma 

distribution. a 


The Exponential Distributions 


A special case of gamma distributions provide a common model for phenomena such 
as waiting times. For instance, in Example 5.7.3, the conditional distribution of each 
service time X; given Z (the rate of service) is a member of the following family of 
distributions. 


Exponential Distributions. Let 8 > 0. A random variable X has the exponential distri- 
bution with parameter f if X has a continuous distribution with the p.d.f. 


Be~®* for x > 0, 


5.7.16 
0 for x <0. ( ) 


FOsiB)=| 


A comparison of the p.d.f’s for gamma and exponential distributions makes the 
following result obvious. 


The exponential distribution with parameter 6 is the same as the gamma distribution 
with parameters a = 1 and f. If X has the exponential distribution with parameter 


B, then 


il 1 
E(X)==— d Var(X)=—, 5.7.17 
3B an ar B ( ) 
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Theorem 
5.7.9 


Example 
5.7.7 


and the m.g.f. of X is 


a fort < B. a 
B-t 


Exponential distributions have a memoryless property similar to that stated in 
Theorem 5.5.5 for geometric distributions. 


Memoryless Property of Exponential Distributions. Let X have the exponential distri- 
bution with parameter £, and let t > 0. Then for every number h > 0, 


Prix >t+h|X >t) =Pr(X =A). (5.7.18) 
Proof For each? > 0, 
Pr(X >t) = a Be ®* dx =e, (5.7.19) 
t 
Hence, for each t > 0 and each h > 0, 


Phe spiaxeje tee 


Pr(X > f) 
—B(t+h) 

ae a= e Fh — Pr(X >h). (5.7.20) 
oA 


You can prove (see Exercise 23) that the exponential distributions are the only 
continuous distributions with the memoryless property. 

To illustrate the memoryless property, we shall suppose that X represents the 
number of minutes that elapse before some event occurs. According to Eq. (5.7.20), 
if the event has not occurred during the first tf minutes, then the probability that the 
event will not occur during the next minutes is simply e~°". This is the same as the 
probability that the event would not occur during an interval of h minutes starting 
from time 0. In other words, regardless of the length of time that has elapsed without 
the occurrence of the event, the probability that the event will occur during the next 
h minutes always has the same value. 

This memoryless property will not strictly be satisfied in all practical problems. 
For example, suppose that X is the length of time for which a light bulb will burn 
before it fails. The length of time for which the bulb can be expected to continue to 
burn in the future will depend on the length of time for which it has been burning 
in the past. Nevertheless, the exponential distribution has been used effectively as 
an approximate distribution for such variables as the lengths of the lives of various 
products. 


Life Tests 


Light Bulbs. Suppose that n light bulbs are burning simultaneously in a test to deter- 
mine the lengths of their lives. We shall assume that the n bulbs burn independently of 
one another and that the lifetime of each bulb has the exponential distribution with 
parameter f. In other words, if X; denotes the lifetime of bulb 7, fori =1,...,n, 
then it is assumed that the random variables X,..., X,, are 1.i.d. and that each has 
the exponential distribution with parameter 6. What is the distribution of the length 
of time Y, until the first failure of one of the n bulbs? What is the distribution of the 
length of time Y> after the first failure until a second bulb fails? < 


Theorem 
5.7.10 


Theorem 
5.7.11 


Example 
5.7.8 
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The random variable Y; in Example 5.7.7 is the minimum of a random sample of 
n exponential random variables. The distribution of Y; is easy to find. 


Suppose that the variables X,,..., X,, form a random sample from the exponential 
distribution with parameter f. Then the distribution of Y; = min{X,, ..., X,,} will be 
the exponential distribution with parameter nf. 


Proof For every number ¢ > 0, 
Pry, >t) =Pr(X, >t,..., X, >t) 
= Pr(X, >f)--- Pr(xX, > 1) 


=e... ght nt 


By comparing this result with Eq. (5.7.19), we see that the distribution of Y, must be 
the exponential distribution with parameter nf. = 


The memoryless property of the exponential distributions allows us to answer 
the second question at the end of Example 5.7.7, as well as similar questions about 
later failures. After one bulb has failed, n — 1 bulbs are still burning. Furthermore, 
regardless of the time at which the first bulb failed or which bulb failed first, it follows 
from the memoryless property of the exponential distribution that the distribution 
of the remaining lifetime of each of the other n — 1 bulbs is still the exponential 
distribution with parameter f. In other words, the situation is the same as it would be 
if we were starting the test over again from time t = 0 withn — 1 new bulbs. Therefore, 
Y> will be equal to the smallest of n — 11.1.d. random variables, each of which has the 
exponential distribution with parameter £. It follows from Theorem 5.7.10 that Y> 
will have the exponential distribution with parameter (n — 1)8. The next result deals 
with the remaining waiting times between failures. 


Suppose that the variables X,,..., X, form a random sample from the exponen- 
tial distribution with parameter f. Let Z| < Z, <--- < Z, be the random variables 
X1,..., X, sorted from smallest to largest. For eachk =2,...,n, let Y, = Z, — Z,_}. 
Then the distribution of Y; is the exponential distribution with parameter (n + 1 — 


k)B. 


Proof At the time Z,_;, exactly k — 1 of the lifetimes have ended and there are 
n-+1-—k lifetimes that have not yet ended. For each of the remaining lifetimes, the 
conditional distribution of what remains of that lifetime given that it has lasted at 
least Z;,_, is still exponential with parameter f by the memoryless property. So, Y; = 
Z, — Z,_1 has the same distribution as the minimum lifetime from a random sample 
of size n + 1—k from the exponential distribution with parameter 6. According to 
Theorem 5.7.10, that distribution is exponential with parameter (n + 1—k)f. a 


Relation to the Poisson Process 


Radioactive Particles. Suppose that radioactive particles strike a target according to a 
Poisson process with rate 6, as defined in Definition 5.4.2. Let Z, be the time until the 
kth particle strikes the target for k = 1, 2, .... What is the distribution of Z,? What 
is the distribution of Y, = Z, — Z,_, fork > 2? 4 


Although the random variables defined at the end of Example 5.7.8 look similar 
to those in Theorem 5.7.11, there are major differences. In Theorem 5.7.11, we were 
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Theorem 
5.7.12 


Corollary 
5.7.1 


observing a fixed number zu of lifetimes that all started simultaneously. The n lifetimes 
are all labeled in advance, and each could be observed independently of the others. 
In Example 5.7.8, there is no fixed number of particles being contemplated, and we 
have no well-defined notion of when each particle “starts” toward the target. In fact, 
we cannot even tell which particle is which until after they are observed. We merely 
start observing at an arbitrary time and record each time a particle hits. Depending 
on how long we observe the process, we could see an arbitrary number of particles 
hit the target in Example 5.7.8, but we could never see more than n failures in the 
setup of Theorem 5.7.11, no matter how long we observe. Theorem 5.7.12 gives the 
distributions for the times between arrivals in Example 5.7.8, and one can see how 
the distributions differ from those in Theorem 5.7.11. 


Times between Arrivals in a Poisson Process. Suppose that arrivals occur according to 
a Poisson process with rate 6. Let Z, be the time until the kth arrival fork =1,2,.... 
Define Y; = Z, and Y, = Z, — Z,_; fork => 2. Then Yj, ¥5,... are iid. and they each 
have the exponential distribution with parameter £. 


Proof Let rt > 0, and define X to be the number of arrivals from time 0 until time rf. 
It is easy to see that Y,; <t if and only if X > 1. That is, the first particle strikes the 
target by time r if and only if at least one particle strikes the target by time t. We 
already know that X has the Poisson distribution with mean fr, where £ is the rate 
of the process. So, for t > 0, 


Pr(Y, <t) = Pr(X > 1) =1— Pr(X =0)=1-e*". 


Comparing this to Eq. (5.7.19), we see that 1 — e~*' is the c.d.f. of the exponential 
distribution with parameter f. 

What happens in a Poisson process after time ¢ is independent of what happens 
up to time ft. Hence, the conditional distribution given Y, =f of the gap from time 
t until the next arrival at Z, is the same as the distribution of the time from time 
0 until the first arrival. That is, the distribution of Y> = Z> — Z, given Y, =t (ie., 
Z, =1t) is the exponential distribution with parameter 6 no matter what t is. Hence, 
Y> is independent of Y, and they have the same distribution. The same argument can 
be applied to find the distributions for Y3, Y4,.... rT] 


An exponential distribution is often used in a practical problem to represent 
the distribution of the time that elapses before the occurrence of some event. For 
example, this distribution has been used to represent such periods of time as the 
period for which a machine or an electronic component will operate without breaking 
down, the period required to take care of a customer at some service facility, and the 
period between the arrivals of two successive customers at a facility. 

If the events being considered occur in accordance with a Poisson process, then 
both the waiting time until an event occurs and the period of time between any two 
successive events will have exponential distributions. This fact provides theoretical 
support for the use of the exponential distribution in many types of problems. 

We can combine Theorem 5.7.12 with Theorem 5.7.7 to obtain the following. 


Time until kth Arrival. In the situation of Theorem 5.7.12, the distribution of Z; is the 
gamma distribution with parameters k and f. a 
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Summary 


The gamma function is defined by F(a) = oe x*—le-* dx and has the property that 
T(n) = (n — 1)! forn =1,2,.... If X1,..., X, are independent random variables 
with gamma distributions all having the same second parameter 8, then )~"_, X; has 
the gamma distribution with first parameter equal to the sum of the first parameters 
of X,,..., X, and second parameter equal to 6. The exponential distribution with 
parameter f is the same as the gamma distribution with parameters 1 and 6. Hence, 
the sum of a random sample of n exponential random variables with parameter 6 
has the gamma distribution with parameters n and f. For a Poisson process with rate 
8, the times between successive occurrences have the exponential distribution with 
parameter §, and they are independent. The waiting time until the kth occurrence 
has the gamma distribution with parameters k and £. 


Exercises 


1. Suppose that X has the gamma distribution with pa- 
rameters a and £, and c is a positive constant. Show that 
cX has the gamma distribution with parameters a and 6 /c. 


2. Compute the quantile function of the exponential dis- 
tribution with parameter f. 


3. Sketch the p.d.f. of the gamma distribution for each of 
the following pairs of values of the parameters a and fp: 
(a) a = 1/2 and 6 = 1, (b) a =1 and B = 1, (c) a =2 and 
B=1. 


4. Determine the mode of the gamma distribution with 
parameters a and p. 


5. Sketch the p.d.f. of the exponential distribution for each 
of the following values of the parameter £: (a) 6 = 1/2, (b) 
B =1, and (c) 6B =2. 


6. Suppose that X;,..., X, form a random sample of 
size n from the exponential distribution with parameter 
B. Determine the distribution of the sample mean X,,. 


7. Let X;, X2, X3 be arandom sample from the exponen- 
tial distribution with parameter £. Find the probability 
that at least one of the random variables is greater than 
t, where t > 0. 


8. Suppose that the random variables X,..., X; are in- 
dependent and X; has the exponential distribution with 
parameter 6; (i =1,...,k). Let Y=min{Xy,..., X;}. 
Show that Y has the exponential distribution with param- 
eter Bj +--+ + Bx. 


9. Suppose that a certain system contains three compo- 
nents that function independently of each other and are 
connected in series, as defined in Exercise 5 of Sec. 3.7, 
so that the system fails as soon as one of the components 
fails. Suppose that the length of life of the first compo- 


nent, measured in hours, has the exponential distribution 
with parameter 6 = 0.001, the length of life of the second 
component has the exponential distribution with parame- 
ter 6 = 0.003, and the length of life of the third component 
has the exponential distribution with parameter 6 = 0.006. 
Determine the probability that the system will not fail be- 
fore 100 hours. 


10. Suppose that an electronic system contains n similar 
components that function independently of each other 
and that are connected in series so that the system fails 
as soon as one of the components fails. Suppose also that 
the length of life of each component, measured in hours, 
has the exponential distribution with mean jw. Determine 
the mean and the variance of the length of time until the 
system fails. 


11. Suppose that n items are being tested simultaneously, 
the items are independent, and the length of life of each 
item has the exponential distribution with parameter £. 
Determine the expected length of time until three items 
have failed. Hint: The required value is E(Y, + Y> + Y3) in 
the notation of Theorem 5.7.11. 


12. Consider again the electronic system described in Ex- 
ercise 10, but suppose now that the system will continue 
to operate until two components have failed. Determine 
the mean and the variance of the length of time until the 
system fails. 


13. Suppose that a certain examination is to be taken by 
five students independently of one another, and the num- 
ber of minutes required by any particular student to com- 
plete the examination has the exponential distribution for 
which the mean is 80. Suppose that the examination be- 
gins at 9:00 a.m. Determine the probability that at least 
one of the students will complete the examination before 
9:40 A.M. 
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14. Suppose again that the examination considered in Ex- 
ercise 13 is taken by five students, and the first student to 
complete the examination finishes at 9:25 a.m. Determine 
the probability that at least one other student will com- 
plete the examination before 10:00 a.m. 


15. Suppose again that the examination considered in Ex- 
ercise 13 is taken by five students. Determine the proba- 
bility that no two students will complete the examination 
within 10 minutes of each other. 


16. Itissaid that arandom variable X has the Pareto distri- 
bution with parameters xq and a (xy > O anda > 0) if X has 
a continuous distribution for which the p.d.f. f (x|xo, @) is 
as follows: 


axy : 

— or x > Xo, 
f(alxo, @) = 4 oH <a 

0 for x < xo. 


Show that if X has this Pareto distribution, then the ran- 
dom variable log(X/x 9) has the exponential distribution 
with parameter a. 


17. Suppose that a random variable X has the normal 
distribution with mean jz and variance o2. Determine the 
value of E[(X — yu)?" form. 2:23.22 


18. Consider a random variable X for which Pr(X > 0) = 
1, the p.d.f. is f, and the c.d.f. is F. Consider also the 
function h defined as follows: 


f (x) 
1— F(x) 


h(x) = for x > 0. 

The function h is called the failure rate or the hazard func- 
tion of X. Show that if X has an exponential distribution, 
then the failure rate h(x) is constant for x > 0. 


19. It is said that a random variable has the Weibull distri- 
bution with parameters a and b (a > 0 and b > 0) if X has 
a continuous distribution for which the p.d-f. f(x|a, b) is 
as follows: 


D b-1,-(/a)? 
f(«la, b) = oh e for x > 0, 
0 for x <0. 


Show that if X has this Weibull distribution, then the ran- 
dom variable X” has the exponential distribution with pa- 
rameter Bp = a”, 

20. It is said that a random variable X has an increasing 
failure rate if the failure rate h(x) defined in Exercise 18 is 
an increasing function of x for x > 0, and it is said that X 
has a decreasing failure rate if h(x) is a decreasing function 
of x for x > 0. Suppose that X has the Weibull distribution 
with parameters a and b, as defined in Exercise 19. Show 


that X has an increasing failure rate if b > 1, and X has a 
decreasing failure rate if b < 1. 


21. Let X have the gamma distribution with parameters 
a>2andf>0. 


a. Prove that the mean of 1/X is B/(a@ — 1). 


b. Prove that the variance of 1/X is f*/[(a — 1)” 
(a — 2). 


22. Consider the Poisson process of radioactive particle 
hits in Example 5.7.8. Suppose that the rate 6 of the Pois- 
son process is unknown and has the gamma distribution 
with parameters w and y. Let X be the number of parti- 
cles that strike the target during ¢ time units. Prove that 
the conditional distribution of 6 given X = x is a gamma 
distribution, and find the parameters of that gamma dis- 
tribution. 


23. Let F be a continuous c.d.f. satisfying F(0) = 0, and 
suppose that the distribution with c.d.f. F has the mem- 
oryless property (5.7.18). Define €(x) = log[1 — F(x)] for 
x>0. 

a. Show that for all rt, h > 0, 
1— F(t +h) 


els aoe = F(t) 


b. Prove that €(t + h) = €(t) + (A) for all t, h > 0. 

c. Prove that for allt > 0 and all positive integers k and 
m, €(kt/m) = (k/m)E€(t). 

d. Prove that for all t, c > 0, €(ct) = c€(t). 

e. Prove that g(t) = €(t)/t is constant for t > 0. 


f. Prove that F must be the c.d.f. of an exponential 
distribution. 


24. Review the derivation of the Black-Scholes formula 
(5.6.18). For this exercise, assume that our stock price at 
time u in the future is Sge““+", where W, has the gamma 
distribution with parameters au and f with 6 > 1. Letr be 
the risk-free interest rate. 

a. Prove that e~”E(S,) = So if and only if w=r— 
a log(B/[B — 1)). 

b. Assume that » =r — a log(f/[6 — 1]). Let R be 1 mi- 
nus the c.d.f. of the gamma distribution with param- 
eters au and 1. Prove that the risk-neutral price for 
the option to buy one share of the stock for the price 
q at time u is SoR(c[B — 1]) — ge" R(cB), where 


c =tog() + au toe( 5°) —ru. 
A _ 


c. Find the price for the option being considered when 
u=1,q =So,r = 0.06, wa = 1, and 6B = 10. 


Example 
5.8.1 


Definition 
5.8.1 


Theorem 
5.8.1 


Example 
5.8.2 


5.8 The Beta Distributions 327 


5.8 The Beta Distributions 


The family of beta distributions is a popular model for random variables that are 
known to take values in the interval [0, 1]. One common example of such a random 
variable is the unknown proportion of successes in a sequence of Bernoulli trials. 


The Beta Function 


Defective Parts. A machine produces parts that are either defective or not, as in 
Example 3.6.9 on page 148. Let P denote the proportion of defectives among all 
parts that might be produced by this machine. Suppose that we observe n such parts, 
and let X be the number of defectives among the n parts observed. If we assume that 
the parts are conditionally independent given P, then we have the same situation as 
in Example 3.6.9, where we computed the conditional p.d-f. of P given X = x as 


pi =—_ py 
Jy id —@)"-*dq 


We are now in a position to calculate the integral in the denominator of Eq. (5.8.1). 
The distribution with the resulting p.d.f. is a member a useful family that we shall 
study in this section. < 


8o(p|x) = for 0 < p <1. (5.8.1) 


The Beta Function. For each positive a and £, define 


iL 


Bia, B) = x? = xP lax. 
0 
The function B is called the beta function. 


We can show that the beta function B is finite for all a, 6B > 0. The proof of the 
following result relies on the methods from the end of Sec. 3.9 and is given at the end 
of this section. 


For all a, B > 0, 


P(@)I(B) 


ee ke a 


(5.8.2) 


Defective Parts. It follows from Theorem 5.8.1 that the integral in the denominator 
of Eq. (5.8.1) is 


[ (1 — g)"*d _ T@+b)ra—-x+))_ x!a—x)! 
0 _ Tin +2) Hee Tl’ 


The conditional p.d-f. of P given X = x is then 


| 
82(p|x) = er p*(1— p)"™*, for0< p<1. < 
x'(n — x)! 
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Definition 
5.8.2 


Example 
5.8.3 


Definition of the Beta Distributions 


The distribution in Example 5.8.2 is a special case of the following. 


Beta Distributions. Let a, 8 > 0 and let X be a random variable with p.d-f. 


P@+B) ot p-1 
fGlap=1 Tamra “2 Mrese<h (5.8.3) 
0 otherwise. 


Then X has the beta distribution with parameters a and B. 


The conditional distribution of P given X =x in Example 5.8.2 is the beta 
distribution with parameters x + 1 andn — x + 1. It can also be seen from Eq. (5.8.3) 
that the beta distribution with parameters a = 1 and 6 = 1 is simply the uniform 
distribution on the interval [0, 1]. 


Castaneda v. Partida. In Example 5.2.6 on page 278, 220 grand jurors were chosen 
from a population that is 79.1 percent Mexican American, but only 100 grand jurors 
were Mexican American. The expected value of a binomial random variable X with 
parameters 220 and 0.791 is E(X) = 220 x 0.791 = 174.02. This is much larger than 
the observed value of X = 100. Of course, such a discrepancy could occur by chance. 
After all, there is positive probability of X = x for allx =0,..., 220. Let P stand for 
the proportion of Mexican Americans among all grand jurors that would be chosen 
under the current system being used. The court assumed that X had the binomial 
distribution with parameters n = 220 and p, conditional on P = p. We should then 
be interested in whether P is substantially less than the value 0.791, which represents 
impartial juror choice. For example, suppose that we define discrimination to mean 
that P <0.8 x 0.791 = 0.6328. We would like to compute the conditional probability 
of P < 0.6328 given X = 100. 

Suppose that the distribution of P prior to observing X was the beta distribution 
with parameters a and §. Then the p.d.f. of P was 


(a+ B) 
P(a)P(B) 
The conditional p.f. of X given P = p is the binomial p.f. 


frp) = pe dep, ford< p<, 


gy(x|p) = (°°) pd=py"™,. forx=0,;..,.220 
x 


We can now apply Bayes’ theorem for random variables (3.6.13) to obtain the con- 
ditional p.d.f. of P given X = 100: 


T(ia+ = i 
rr p'00(1 — py120 ( B) a val yb-1 


P 
P(@)V(B) 
100) = 
82(Pl ») f,(100) 
(oo) @ + 8) a+100-1¢q __ yy B-+120-1. (5.8.4) 


~ P(a)P(B) f,(100) 


for 0 < p <1, where f,(100) is the marginal pf. of X at 100. As a function of 
p the far right side of Eq. (5.8.4) is a constant times p%+!00-1(1 — p)8+120-1 for 
0 < p <1. As such, it is clearly the p.d.f. of a beta distribution. The parameters 


Theorem 
5.8.2 


Theorem 
5.8.3 
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of that beta distribution are a + 100 and 6 + 120. Hence, the constant must be 
1/B(100 + a, 120 + B). That is, 


Tia + B +220 . _ 
@(p|100) = B ) a+100-1¢q yy 6+120-1, 


for0 <p <1. 
l(a + 100) (6 + 120) 


(5.8.5) 


After choosing values of a and 8, we could compute Pr(P < 0.6328|X = 100) and 
decide how likely it is that there was discrimination. We will see how to choose a and 
6 after we learn how to compute the expected value of a beta random variable. < 


Note: Conditional Distribution of P after Observing X with Binomial Distribu- 
tion. The calculation of the conditional distribution of P given X = 100 in Exam- 
ple 5.8.3 is a special case of a useful general result. In fact, the proof of the following 
result is essentially given in Example 5.8.3, and will not be repeated. 


Suppose that P has the beta distribution with parameters a and f, and the conditional 
distribution of X given P = p is the binomial distribution with parameters n and 
p. Then the conditional distribution of P given X = x is the beta distribution with 
parameters wa +x and B+n—x. a 


Moments of Beta Distributions 


Moments. Suppose that X has the beta distribution with parameters w and £. Then 
for each positive integer k, 
a(a+1)---(a+k-—1) 


E(XX) = : 
(a+ B\(@a+fh+1)---@+B+k—-I) 


(5.8.6) 


In particular, 


a 
E(X)= ; 
0 a+B 


ap 
(a+ pra+ph+1) 


Var(X) = 


Proof Fork =1,2,..., 
1 
EC) = | x* f(xla, B) dx 
0 


= Aer 8) gL = a) de, 
P(@)P(B) Jo 
Therefore, by Eq. (5.8.2), 
T@+h) Ta@+khr(p) 
P@r(p) Ta@+k+p)’ 


which simplifies to Eq. (5.8.6). The special case of the mean is simple, while the 
variance follows easily from 


E(X) = 


a(a +1) 
(a+ p)(a+B+1) 


E(X*%) = ml 


There are too many beta distributions to provide tables in the back of the 
book. Any good statistical package will be able to calculate the c.d.f.’s of many beta 
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Figure 5.8 Probability of 
discrimination as a function 
of B. 


Example 
5.8.4 


Example 
5.8.5 


A 


1.04 


0.8 5 


Probability of P at most 0.6328 


distributions, and some packages will also be able to calculate the quantile functions. 
The next example illustrates the importance of being able to calculate means and 
c.d.f’s of beta distributions. 


Castaneda v. Partida. Continuing Example 5.8.3, we are now prepared to see why, for 
every reasonable choice one makes for a and #, the probability of discrimination in 
Castaneda v. Partida is quite large. To avoid bias either for or against the defendant, 
we shall suppose that, before learning X, the probability that a Mexican American 
juror would be selected on each draw from the pool was 0.791. Let Y = 1ifa Mexican 
American juror is selected on a single draw, and let Y = 0 if not. Then Y has the 
Bernoulli distribution with parameter p given P = p and E(Y|p) = p. So the law of 
total probability for expectations, Theorem 4.7.1, says that 


Pr(¥ = 1) = E(Y) = E[E(Y|P)]= E(P). 


This means that we should choose a@ and f so that E(P) =0.791. Because E(P) = 
a/(a + B), this means that a = 3.7858. The conditional distribution of P given X = 
100 is the beta distribution with parameters a + 100 and 8 + 120. For each value of 
B > 0, wecan compute Pr(P < 0.6328|X = 100) using a = 3.7858. Then, for each 6 we 
can check whether or not that probability is small. A plot of Pr(P < 0.6328|X = 100) 
for various values of 6 is given in Fig. 5.8. From the figure, we see that Pr(P < 
0.6328|X = 100) < 0.5 only for 6 > 51.5. This makes a > 194.9. We claim that the 
beta distribution with parameters 194.9 and 51.5 as well as all others that make 
Pr(P <0.6328|X = 100) < 0.5 are unreasonable because they are incredibly preju- 
diced about the possibility of discrimination. For example, suppose that someone 
actually believed, before observing X = 100, that the distribution of P was the beta 
distribution with parameters 194.9 and 51.5. For this beta distribution, the proba- 
bility that there is discrimination would be Pr(P < 0.6328) = 3.28 x 10-8, which is 
essentially 0. All of the other priors with 6 > 51.5 and aw = 3.7858 have even smaller 
probabilities of {P < 0.6328}. Arguing from the other direction, we have the fol- 
lowing: Anyone who believed, before observing X = 100, that E(P) = 0.791 and the 
probability of discrimination was greater than 3.28 x 10-8, would believe that the 
probability of discrimination is at least 0.5 after learning X = 100. This is then fairly 
convincing evidence that there was discrimination in this case. < 


A Clinical Trial. Consider the clinical trial described in Example 2.1.4. Let P be the 
proportion of all patients in a large group receiving imipramine who have no relapse 
(called success). A popular model for P is that P has the beta distribution with 


, 
“ 
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parameters a and 6. Choosing w and f can be done based on expert opinion about the 
chance of success and on the effect that data should have on the distribution of P after 
observing the data. For example, suppose that the doctors running the clinical trial 
think that the probability of success should be around 1/3. Let X; = 1if the ith patient 
is a success and X; = 0 if not. We are supposing that E(X;|p) = Pr(X; = 1|p) = p, so 
the law of total probability for expectations (Theorem 4.7.1) says that 


Pr(X; = 1) = E(X;) = E[E(X;|P)] = E(P) = —. 
a+B 
If we want Pr(X; = 1) = 1/3, we need a/(a + 8) = 1/3, so B = 2a. Of course, the 
doctors will revise the probability of success after observing patients from the study. 
The doctors can choose a and 6 based on how that revision will occur. 

Assume that the random variables X,, X>, . . . (the indicators of success) are con- 
ditionally independent given P = p. Let X = X,+---+ X,, be the number of patients 
out of the first n who are successes. The conditional distribution of X given P = p 
is the binomial distribution with parameters n and p, and the marginal distribution 
of P is the beta distribution with parameters a and 8. Theorem 5.8.2 tells us that 
the conditional distribution of P given X = x is the beta distribution with parame- 
ters a +x and 6 +n —x. Suppose that a sequence of 20 patients, all of whom are 
successes, would raise the doctors’ probability of success from 1/3 up to 0.9. Then 


a +20 


(=F KS" 
a+ B +20 


This equation implies that a + 20 = 98. Combining this with 6 = 2a, we geta = 1.18 
and 6 =2.35. 

Finally, we can ask, what will be the distribution of P after observing some 
patients in the study? Suppose that 40 patients are actually observed, and 22 of them 
recover (as in Table 2.1). Then the conditional distribution of P given this observation 
is the beta distribution with parameters 1.18 + 22 = 23.18 and 2.35 + 18 = 20.35. It 
follows that 


E(P|X =22) = 28 ___ 9.5305, 
23.18 + 20.35 
Notice how much closer this is to the proportion of successes (0.55) than was E(P) = 


1/3. < 


Proof of Theorem 5.8.1. 


Theorem 
5.8.4 


Theorem 5.8.1, i.e., Eq. (5.8.2), is part of the following useful result. The proof uses 
Theorem 3.9.5 (multivariate transformation of random variables). If you did not 
study Theorem 3.9.5, you will not be able to follow the proof of Theorem 5.8.4. 


Let U and V be independent random variables with U having the gamma distribution 
with parameters a and 1 and V having the gamma distribution with parameters 6 and 
1. Then 

e X=U/(U+ V) and Y =U + V are independent, 

e X has the beta distribution with parameters a and 8, and 


¢ Y has the gamma distribution with parameters a + f and 1. 


Also, Eq. (5.8.2) holds. 
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Proof Because U and V are independent, the joint p.d.f. of U and V is the product 
of their marginal p.d.f.’s, which are 


uel —u 
fi) => Prey for u> 0, 
a 
soa test 
(v) = , forv>0. 
P(B) 
So, the joint p.d.f. is 
FC ) ye—lyb-le-Wu+v) 
u, vy) = ——__—_- 


P(a)P(B) 
foru >Oandv>0. 
The transformation from (u, v) to (x, y) is 


x=r,(u, co and y=nu, v)=u+v, 
u+u 


and the inverse is 
u=s\(x, y) =xy andv=s7(x, y) =(1—x)y. 


The Jacobian is the determinant of the matrix 


y Xx 
J= , 
he fa 


which equals y. According to Theorem 3.9.5, the joint p.d.f. of (X, Y) is then 


g(x, y) = f(s, y), 82(*, yy 
a—11 _ ,)6-1,,a+8-1,-y 
el a ali (5.8.7) 
P(a@)P(B) 

for 0 <x <1and y>0. Notice that this joint p.d.f. factors into separate functions 
of x and y, and hence X and Y are independent. The marginal distribution of Y is 
available from Theorem 5.7.7. The marginal p.d.f. of X is obtained by integrating y 
out of (5.8.7): 


lo) xe — x)P-lyetP-le-y 
= d 
a I F@r) 
= aay [ yetB-le-Ygy 
T@r(p) Jo : 
_ l(a aa B) a1 p-1 
= Trig)" (l—x)P™, (5.8.8) 


where the last equation follows from (5.7.2). Because the far right side of (5.8.8) is 
a p.d.f., it integrates to 1, which proves Eq. (5.8.2). Also, one can recognize the far 
right side of (5.8.8) as the p.d.f. of the beta distribution with parameters a and 6. # 


o, 
“ 


Summary 


The family of beta distributions is a popular model for random variables that lie in 
the interval (0, 1), such as unknown proportions of success for sequences of Bernoulli 
trials. The mean of the beta distribution with parameters a and B is w/(a@ + B). If X 
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has the binomial distribution with parameters n and p conditional on P = p, and if 
P has the beta distribution with parameters a and £, then, conditional on X = x, the 
distribution of P is the beta distribution with parameters a + x and B +n — x. 


Exercises 


1. Compute the quantile function of the beta distribution 
with parameters a > 0 and 6 = 1. 


2. Determine the mode of the beta distribution with pa- 
rameters @ and £, assuming that wa > 1 and f > 1. 


3. Sketch the p.d.f. of the beta distribution for each of the 
following pairs of values of the parameters: 
a. a = 1/2 and 6 = 1/2 
ce a=1/2 and B=2 
e.a=lands=2 
g. a =25 and B = 100 


b. a=1/2 and p=1 
d.a=landB=1 
f. a =2 and B=2 
h. a = 100 and B = 25 


4. Suppose that X has the beta distribution with param- 
eters a and f. Show that 1 — X has the beta distribution 
with parameters 6 and a. 


5. Suppose that X has the beta distribution with param- 
eters a and f, and let r and s be given positive integers. 
Determine the value of E[X"(1 — X)*]. 


6. Suppose that X and Y are independent random vari- 
ables, X has the gamma distribution with parameters a, 
and f, and Y has the gamma distribution with parameters 
a and 6. Let U = X/(X + Y) and V = X + Y. Show that 
(a) U has the beta distribution with parameters a; and a, 
and (b) U and V are independent. Hint: Look at the steps 
in the proof of Theorem 5.8.1. 


7. Suppose that X; and Xz form a random sample of two 
observed values from the exponential distribution with 
parameter £. Show that X,/(X;+ X2) has the uniform 
distribution on the interval [0, 1]. 


8. Suppose that the proportion X of defective items in a 
large lot is unknown and that X has the beta distribution 
with parameters a and f. 


a. If one item is selected at random from the lot, what 
is the probability that it will be defective? 


b. Iftwo items are selected at random from the lot, what 
is the probability that both will be defective? 


9. A manufacturer believes that an unknown proportion 
P of parts produced will be defective. She models P as 
having a beta distribution. The manufacturer thinks that P 
should be around 0.05, but if the first 10 observed products 
were all defective, the mean of P would rise from 0.05 to 
0.9. Find the beta distribution that has these properties. 


10. A marketer is interested in how many customers are 
likely to buy a particular product in a particular store. Let 
P be the proportion of all customers in the store who will 
buy the product. Let the distribution of P be uniform on 
the interval [0, 1] before observing any data. The marketer 
then observes 25 customers and only six buy the product. 
If the customers were conditionally independent given P, 
find the conditional distribution of P given the observed 
customers. 


5.9 The Multinomial Distributions 


Many times we observe data that can assume three or more possible values. The 
family of multinomial distributions is an extension of the family of binomial 
distributions to handle these cases. The multinomial distributions are multivariate 


distributions. 


Definition and Derivation of Multinomial Distributions 


Example 
5.9.1 


Blood Types. In Example 1.8.4 on page 34, we discussed human blood types, of which 
there are four: O, A, B, and AB. If a number of people are chosen at random, we 


might be interested in the probability of obtaining certain numbers of each blood 
type. Such calculations are used in the courts during paternity suits. J 


In general, suppose that a population contains items of k different types (k > 2) 
and that the proportion of the items in the population that are of type i is p; 
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(i =1,...,k). It is assumed that p; > 0 fori =1,...,k, and pH. Let p= 
(p1, .--, Px) denote the vector of these probabilities. 

Next, suppose that n items are selected at random from the population, with 
replacement, and let X; denote the number of selected items that are of type i 
(i =1,...,). Because the n items are selected from the population at random with 
replacement, the selections will be independent of each other. Hence, the probability 
that the first item will be of type i,, the second item of type i5, and so on, is simply 
Pi, Pi, --- Pi,- Therefore, the probability that the sequence of n outcomes will consist 
of exactly x, items of type 1, x» items of type 2, and so on, selected in a particular 
prespecified order, is p;'p,’ . . . p;‘. It follows that the probability of obtaining exactly 
x; items of type i (i = 1, ..., k) is equal to the probability p}'p;? .. . p,* multiplied by 
the total number of different ways in which the order of the n items can be specified. 

From the discussion that led to the definition of multinomial coefficients (Defini- 
tion 1.9.1), it follows that the total number of different ways in which n items can be 


arranged when there are x; items of type i ( =1,..., &) is given by the multinomial 
coefficient 
( n ) a n! 
Kise £55 Np - Xylxo!- + 24! 
In the notation of multivariate distributions, let ¥ = (X,,..., X;,) denote the random 
vector of counts, and let x = (x1, ..., x,) denote a possible value for that random 


vector. Finally, let f(x|n, p) denote the joint p.f. of X. Then 
fn, p) = Pr(X =x) = Pr(X, =x, ..., X, = X,) 


n xy Xz : 
tee ifxy+--- +x, =n, 
= (. ae ) a! Pk : ‘i (5.9.1) 
0 otherwise. 
Multinomial Distributions. A discrete random vector X = (X),..., X,) whose p.f. 


is given by Eq. (5.9.1) has the multinomial distribution with parameters n and p = 
(Pi, +++ Pk): 


Attendance at a Baseball Game. Suppose that 23 percent of the people attending a 
certain baseball game live within 10 miles of the stadium, 59 percent live between 
10 and 50 miles from the stadium, and 18 percent live more than 50 miles from 
the stadium. Suppose also that 20 people are selected at random from the crowd 
attending the game. We shall determine the probability that seven of the people 
selected live within 10 miles of the stadium, eight of them live between 10 and 50 
miles from the stadium, and five of them live more than 50 miles from the stadium. 

We shall assume that the crowd attending the game is so large that it is irrelevant 
whether the 20 people are selected with or without replacement. We can therefore 
assume that they were selected with replacement. It then follows from Eq. (5.9.1) 
that the required probability is 


20! 


TG (0.23)(0.59)8(0.18)° = 0.0094. < 


Blood Types. Berry and Geisser (1986) estimate the probabilities of the four blood 
types in Table 5.3 based on a sample of 6004 white Californians that was analyzed by 
Grunbaum et al. (1978). Suppose that we will select two people at random from this 
population and observe their blood types. What is the probability that they will both 
have the same blood type? The event that the two people have the same blood type 
is the union of four disjoint events, each of which is the event that the two people 
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Table 5.3 Estimated probabilities of blood 
types for white Californians 


A B AB O 


0.360 0.123 0.038 0.479 


both have one of the four different blood types. Each of these events has probability 
bs ) times the square of one of the four probabilities. The probability that we want 
is the sum of the probabilities of the four events: 


(, a ‘) (0.3607 + 0.123 + 0.0387 + 0.4797) = 0.376. < 


Relation between the Multinomial and Binomial Distributions 


When the population being sampled contains only two different types of items, 
that is, when k = 2, each multinomial distribution reduces to essentially a binomial 
distribution. The precise form of this relationship is as follows. 


Suppose that the random vector X = (X,, X>) has the multinomial distribution with 
parameters n and p = (pj, p2). Then X, has the binomial distribution with parameters 
nand p;,and X, =n — Xj. 


Proof It is clear from the definition of multinomial distributions that X¥, =n — X, 
and p, = 1 — p,. Therefore, the random vector X is actually determined by the single 
random variable X,. From the derivation of the multinomial distribution, we see that 
X, is the number of items of type 1 that are selected if n items are selected from a 
population consisting of two types of items. If we call items of type 1 “success,” then 
X, is the number of successes in n Bernoulli trials with probability of success on each 
trial equal to p,. It follows that X, has the binomial distribution with parameters n 
and pj. a 


The proof of Theorem 5.9.1 extends easily to the following result. 


Suppose that the random vector X = (X),..., X;,) has the multinomial distribution 
with parameters n and p= (p),..., pz). The marginal distribution of each variable 
X; G =1,..., k) is the binomial distribution with parameters n and p;. 


Proof Choose one i from 1, ..., k, and define success to be the selection of an item 
of type i. Then X; is the number of successes in n Bernoulli trials with probability of 
sucess on each trial equal to p;. rT] 


A further generalization of Corollary 5.9.1 is that the marginal distribution of the 
sum of some of the coordinates of a multinomial vector has a binomial distribution. 
The proof is left to Exercise 1 in this section. 


Suppose that the random vector X = (Xj, ..., X;,) has the multinomial distribution 
with parameters n and p= (p,,..., p,) with k > 2. Let €<k, and let i;,..., i, be 
distinct elements of the set {1,..., k}. The distribution of Y = 5 OR ee eer is the 


binomial distribution with parameters n and p;, +--+ + Pj,- a 
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As a final note, the relationship between Bernoulli and binomial distributions 
extends to multinomial distributions. The Bernoulli distribution with parameter p is 
the same as the binomial distribution with parameters 1 and p. However, there is no 
separate name for a multinomial distribution with first parameter n = 1. A random 
vector with such a distribution will consist of a single 1 in one of its coordinates and 
k — 1 zeros in the other coordinates. The probability is p; that the ith coordinate is 
the 1. A k-dimensional vector seems an unwieldy way to represent a random object 
that can take only k different values. A more common representation would be as 
a single discrete random variable X that takes one of the k values 1,..., k with 
probabilities p,,..., p,, respectively. The univarite distribution just described has 
no famous name associated with it; however, we have just shown that it is closely 
related to the multinomial distribution with parameters 1 and (pj, ..., Dx). 


Means, Variances, and Covariances 


The means, variances, and covaraiances of the coordinates of a multinomial random 
vector are given by the next result. 


Means, Variances, and Covariances. Let the random vector X have the multinomial 
distribution with parameters n and p. The means and variances of the coordinates of 
X are 


E(X;)=np,; and Var(X;)=np;d—p;) fori=1,...,k. (5.9.2) 
Also, the covariances between the coordinates are 
Proof Corollary 5.9.1 says that the marginal distribution of each component X; is 
the binomial distribution with parameters n and p;. Eq. 5.9.2 follows directly from 
this fact. 


Corollary 5.9.2 says that X; + X; has the binomial distribution with parameters 
nand p; + p;. Hence, 


Var(X; + Xj) =n(p; + pj) — pj; — Pj). (5.9.4) 
According to Theorem 4.6.6, it is also true that 
Var(X; + Xj) = Var(X;) + Var(X ;) + 2 Cov(X;, Xj) 


=np;(1— p;) +np;(1— p,;) +2 Cov(X;, X;). (5.9.5) 
Equate the right sides of (5.9.4) and (5.9.5), and solve for Cov(X;, X ;). The result is 
(5.9.3). rT] 


Note: Negative Covariance Is Natural for Multinomial Distributions. The negative 
covariance between different coordinates of a multinomial vector is natural since 
there are only n selections to be distributed among the k coordinates of the vector. If 
one of the coordinates is large, at least some of the others have to be small because 
the sum of the coordinates is fixed at n. 


Summary 


Multinomial distributions extend binomial distributions to counts of more than two 
possible outcomes. The ith coordinate of a vector having the multinomial distribution 


with parameters and p = (pj, .. 


5.10 The Bivariate Normal Distributions 337 


., Py) has the binomial distribution with parameters 


n and p; fori=1,...,k. Hence, the means and variances of the coordinates of 
a multinomial vector are the same as those of a binomial random variable. The 
covariance between the ith and jth coordinates is —np; p;. 


Exercises 


1. Prove Corollary 5.9.2. 


2. Suppose that F is a continuous c.d.f. on the real line, 
and let a; and a) be numbers such that F(a,) = 0.3 and 
F (a7) = 0.8. If 25 observations are selected at random 
from the distribution for which the c.d.f. is F, what is the 
probability that six of the observed values will be less than 
a1, 10 of the observed values will be between a, and ap, 
and nine of the observed values will be greater than a7? 


3. If five balanced dice are rolled, what is the probability 
that the number 1 and the number 4 will appear the same 
number of times? 


4. Suppose that a die is loaded so that each of the numbers 
1, 2,3, 4,5, and 6 has a different probability of appearing 
when the die is rolled. For i=1,...,6, let p; denote 
the probability that the number 7 will be obtained, and 
suppose that py = 0.11, po = 0.30, p3 = 0.22, pg = 0.05, 
Ps = 0.25, and po = 0.07. Suppose also that the die is to 
be rolled 40 times. Let X; denote the number of rolls 
for which an even number appears, and let X, denote 
the number of rolls for which either the number 1 or 
the number 3 appears. Find the value of Pr(X, = 20 and 
X> = 15). 


5. Suppose that 16 percent of the students in a certain 
high school are freshmen, 14 percent are sophomores, 38 
percent are juniors, and 32 percent are seniors. If 15 stu- 
dents are selected at random from the school, what is the 
probability that at least eight will be either freshmen or 
sophomores? 


6. In Exercise 5, let X3 denote the number of juniors 
in the random sample of 15 students, and let X4 denote 
the number of seniors in the sample. Find the value of 
E(X3 — X4) and the value of Var(X3 — X4). 


7. Suppose that the random variables Xj, ..., X;, are in- 
dependent and that X; has the Poisson distribution with 
mean A; (i =1,...,). Show that for each fixed posi- 
tive integer n, the conditional distribution of the ran- 
dom vector X = (Xj,..., X,), given that san X,; =n, 
is the multinomial distribution with parameters n and 


P=(P.---,» Px), Where 
A: 
Pi=—=— + fori=1,...,k. 
ats 


8. Suppose that the parts produced by a machine can have 
three different levels of functionality: working, impaired, 
defective. Let pj, po, and p3=1-— p; — p> be the prob- 
abilities that a part is working, impaired, and defective, 
respectively. Suppose that the vector p = (pj, p2) is un- 
known but has a joint distribution with p.d.f. 


12p; for 0 < py, pp <1 


f(P1, P2) = and pj + p2 <1, 


0 otherwise. 


Suppose that we observe 10 parts that are conditionally 
independent given p, and among those 10 parts, eight 
are working and two are impaired. Find the conditional 
p.d.f. of p given the observed parts. Hint: You might find 
Eq. (5.8.2) helpful. 


5.10 The Bivariate Normal Distributions 


The first family of multivariate continuous distributions for which we have a name 
is a generalization of the family of normal distributions to two coordinates. There 
is more structure to a bivariate normal distribution than just a pair of normal 


marginal distributions. 


Definition and Derivation of Bivariate Normal Distributions 


Example 
5.10.1 


Thyroid Hormones. Production of rocket fuel produces a chemical, perchlorate, that 
has found its way into drinking water supplies. Perchlorate is suspected of inhibiting 


thyroid function. Experiments have been performed in which laboratory rats have 
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been dosed with perchlorate in their drinking water. After several weeks, rats were 
sacrificed, and a number of thyroid hormones were measured. The levels of these hor- 
mones were then compared to the levels of the same hormones in rats that received 
no perchlorate in their water. Two hormones, TSH and T4, were of particular inter- 
est. Experimenters were interested in the joint distribution of TSH and T4. Although 
each of the hormones might be modeled with a normal distribution, a bivariate dis- 
tribution is needed in order to model the two hormone levels jointly. Knowledge of 
thyroid activity suggests that the levels of these hormones will not be independent, 
because one of them is actually used by the thyroid to stimulate production of the 
other. <l 


If researchers are comfortable using the family of normal distributions to model 
each of two random variables separately, such as the hormones in Example 5.10.1, 
then they need a bivariate generalization of the family of normal distributions that 
still has normal distributions for its marginals while allowing the two random vari- 
ables to be dependent. A simple way to create such a generalization is to make use 
of the result in Corollary 5.6.1. That result says that a linear combination of indepen- 
dent normal random variables has a normal distribution. If we create two different 
linear combinations X, and X, of the same independent normal random variables, 
then X, and X, will each have a normal distribution and they might be dependent. 
The following result formalizes this idea. 


Suppose that Z, and Z, are independent random variables, each of which has the 
standard normal distribution. Let 2, 43, 01, 07, and p be constants such that —oo < 
fh; < 00 =1,2),0, > 0 =1, 2), and —1 < p < 1. Define two new random variables 
X, and X> as follows: 


X1= 0121 + My, 
XxX) =02 [oz + (a = pz, + LH. (5.10.1) 


The joint p.d-f. of X; and X> is 


= 1 1 xy—m\ 
ee er ee oo 20 — p?) ( a ) one 


2 
20(% — 1) (2 = Ha) re (2 = 2) 
O71 02 7) 
Proof This proof relies on Theorem 3.9.5 (multivariate transformation of random 


variables). If you did not study Theorem 3.9.5, you won’t be able to follow this proof. 
The joint p.d-f. g(z1, z2) of Z; and Z> is 


1 1 
B(21, 22) = 5— exp| 302 + | ; (5.10.3) 


for all z and zp. 
The inverse of the transformation (5.10.1) is (Z;, Z2) = (s|(X 1, X2), 89(X1, X2)), 
where 


= 
51(X1, X2) = ee 
1 
(5.10.4) 
_ 1 Xy— M2 My My 
nev aaa (AEE). 
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The Jacobian J of the transformation is 

— 0 

O71 = iL 

== 1 (1 = p?)oyoy" 
o(1—p*)¥?— o,(1— p?)? 


J = det (5.10.5) 


If one substitutes s;(x;, x2) for z; (i =1, 2) in Eq. (5.10.3) and then multiplies by 
|J|, one obtains Eq. (5.10.2), which is the joint p.d.f. of (X;, X2) according to Theo- 
rem 3.9.5. a 


Some simple properties of the distribution with p.d.f. in Eq. (5.10.2) are worth 
deriving before giving a name to the joint distribution. 


Suppose that X, and X, have the joint distribution whose p.d.f. is given by Eq. (5.10.2). 
Then there exist independent standard normal random variables Z, and Z, such 
that Eqs. (5.10.1) hold. Also, the mean of X; is 4; and the variance of X; is oa? for 
i =1, 2. Furthermore the correlation between X, and X; is p. Finally, the marginal 
distribution of X; is the normal distribution with mean ju; and variance oa? fori =1, 2. 


Proof Use the functions s; and sz defined in Eqs. (5.10.4) and define Z; = s;(X,, X2) 
for i = 1, 2. By running the proof of Theorem 5.10.1 in reverse, we see that the joint 
p.d.f. of Z; and Z> is Eq. (5.10.3). Hence, Z, and Z, are independent standard normal 
random variables. 

The values of the means and variances of X, and X, are easily obtained by apply- 
ing Corollary 5.6.1 to Eq. (5.10.1). If one applies the result in Exercise 8 of Sec. 4.6, 
one obtains Cov(X,, X2) = 0407 . It now follows that p is the correlation. The claim 
about the marginal distributions of X, and X, is immediate from Corollary 5.6.1. m 


We are now ready to define the family of bivariate normal distributions. 


Bivariate Normal Distributions. When the joint p.d.f. of two random variables X, and 
X> is of the form in Eq. (5.10.2), it is said that X, and X> have the bivariate normal 
distribution with means [11 and {1>, variances o and a, and correlation p. 


It was convenient for us to derive the bivariate normal distributions as the joint 
distributions of certain linear combinations of independent random variables hav- 
ing standard normal distributions. It should be emphasized, however, that bivariate 
normal distributions arise directly and naturally in many practical problems. For ex- 
ample, for many populations the joint distribution of two physical characteristics such 
as the heights and the weights of the individuals in the population will be approxi- 
mately a bivariate normal distribution. For other populations, the joint distribution 
of the scores of the individuals in the population on two related tests will be approx- 
imately a bivariate normal distribution. 


Anthropometry of Flea Beetles. Lubischew (1962) reports the measurements of several 
physical features of a variety of species of flea beetle. The investigation was concerned 
with whether some combination of easily obtained measurements could be used to 
distinguish the different species. Figure 5.9 shows a scatterplot of measurements of 
the first joint in the first tarsus versus the second joint in the first tarsus for a sample of 
31 from the species Chaetocnema heikertingeri. The plot also includes three ellipses 
that correspond to a fitted bivariate normal distribution. The ellipses were chosen 
to contain 25%, 50%, and 75% of the probability of the fitted bivariate normal 


340 Chapter 5 Special Distributions 


Figure 5.9 Scatterplot of 
flea beetle data with 25%, 
50%, and 75% bivariate 
normal ellipses for Exam- 
ple 5.10.2. 
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Second tarsus joint 


180 190 200 210 220 230 240 
First tarsus joint 


distribution. The fitted distribution is is the bivariate normal distribution with means 
201 and 119.3, variances 222.1 and 44.2, and correlation 0.64. <l 


Properties of Bivariate Normal Distributions 


For random variables with a bivariate normal distribution, we find that being inde- 
pendent is equivalent to being uncorrelated. 


Independence and Correlation. Two random variables X, and X> that have a bivariate 
normal distribution are independent if and only if they are uncorrelated. 


Proof The “only if” direction is already known from Theorem 4.6.4. For the “if” 
direction, assume that X, and X, are uncorrelated. Then p = 0, and it can be seen 
from Eq. (5.10.2) that the joint p.d-f. f (x1, x) factors into the product of the marginal 
p.d.f. of X; and the marginal p.d-f. of X>. Hence, X; and X> are independent. rT] 


We have already seen in Example 4.6.4 that two random variables X, and X, 
with an arbitrary joint distribution can be uncorrelated without being independent. 
Theorem 5.10.3 says that no such examples exist in which X; and X> have a bivariate 
normal distribution. 

When the correlation is not zero, Theorem 5.10.2 gives the marginal distributions 
of bivariate normal random variables. Combining the marginal and joint distributions 
allows us to find the conditional distributions of each X; given the other one. The next 
theorem derives the conditional distributions using another technique. 


Conditional Distributions. Let X; and X, have the bivariate normal distribution whose 
p.d.f. is Eq. (5.10.2). The conditional distribution of X, given that X, = x, is the normal 
distribution with mean and variance given by 


E(Xp|x1) = ty + poy (a4) ,  Var(X2|x1) =(1—p)o3. (5.10.6) 
1 


Proof We will make liberal use of Theorem 5.10.2 and its notation in this proof. Con- 
ditioning on Xj = xj; is the same as conditioning on Z; = (x1 — (41)/0,. When we want 
to find the conditional distribution of X> given Z, = (x; — [44)/04, we can subtitute 
(x1 — 44)/o, for Z, in the formula for X, in Eq. (5.10.1) and find the conditional dis- 
tribution for the rest of the formula. That is, the conditional distribution of X> given 
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that X; = x; 1s the same as the conditional distribution of 


(l= p*)/?09Zy + my + po,( =) (5.10.7) 


O71 


given Z; = (x1 — (41)/o1. But Z, is the only random variable in Eq. (5.10.7), and Z 
is independent of Z,. Hence, the conditional distribution of X, given X, = x, is the 
marginal distribution of Eq. (5.10.7), namely, the normal distribution with mean and 
variance given by Eq. (5.10.6). | 


The conditional distribution of X, given that X, = x, cannot be derived so easily 
from Eq. (5.10.1) because of the different ways in which Z, and Z, enter Eq. (5.10.1). 
However, it is seen from Eq. (5.10.2) that the joint distribution of X> and X, is also 
bivariate normal with all of the subscripts 1 and 2 swithched on all of the parameters. 
Hence, we can apply Theorem 5.10.4 to Xz and X, to conclude that the conditional 
distribution of X, given that X, = x, must be the normal distribution with mean and 
variance 


E(X1|x2) = hy, + poy (2 2 Ha) ,  Var(X4|x)=(1— p”)of. (5.10.8) 
2 

We have now shown that each marginal distribution and each conditional distri- 
bution of a bivariate normal distribution is a univariate normal distribution. 

Some particular features of the conditional distribution of X, given that X, = 
x, should be noted. If o 40, then E(X>|x,) is a linear function of x,. If p > 0, 
the slope of this linear function is positive. If p <0, the slope of the function is 
negative. However, the variance of the conditional distribution of X, given that 
X,=x,is 1—- p°)o%, which does not depend on x;. Furthermore, this variance of 
the conditional distribution of X, is smaller than the variance oe of the marginal 
distribution of X>. 


Predicting a Person’s Weight. Let X, denote the height of a person selected at random 
from a certain population, and let X denote the weight of the person. Suppose that 
these random variables have the bivariate normal distribution for which the p.d_f. is 
specified by Eq. (5.10.2) and that the person’s weight X. must be predicted. We shall 
compare the smallest M.S.E. that can be attained if the person’s height X, is known 
when her weight must be predicted with the smallest M.S.E. that can be attained if 
her height is not known. 

If the person’s height is not known, then the best prediction of her weight is the 
mean E(X>) = /t9, and the M.S.E. of this prediction is the variance a3. If it is known 
that the person’s height is x,, then the best prediction is the mean E(X>|x,) of the 
conditional distribution of X> given that X; = x1, and the M.S.E. of this prediction is 
the variance (1 — p°)o%, of that conditional distribution. Hence, when the value of X, 
is known, the M.S.E. is reduced from a3 to 1 -— p°)o5. <l 


Since the variance of the conditional distribution in Example 5.10.3 is (1 — p ya; 
regardless of the known height x; of the person, it follows that the difficulty of 
predicting the person’s weight is the same for a tall person, a short person, or a 
person of medium height. Furthermore, since the variance (1 — p*)o% decreases as 
|p| increases, it follows that it is easier to predict a person’s weight from her height 
when the person is selected from a population in which height and weight are highly 
correlated. 
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Determining a Marginal Distribution. Suppose that a random variable X has the nor- 
mal distribution with mean p and variance o”, and that for every number x, the 
conditional distribution of another random variable Y given that X = x is the normal 
distribution with mean x and variance t7. We shall determine the marginal distribu- 
tion of Y. 

We know that the marginal distribution of X is a normal distribution, and the 
conditional distribution of Y given that X = x is a normal distribution, for which the 
mean is a linear function of x and the variance is constant. It follows that the joint 
distribution of X and Y must be a bivariate normal distribution (see Exercise 14). 
Hence, the marginal distribution of Y is also a normal distribution. The mean and 
the variance of Y must be determined. 

The mean of Y is 


E(Y) = E[E(Y|X)]= E(X) =n. 
Furthermore, by Theorem 4.7.4, 
Var(Y) = E[Var(Y|X)]+ Var[E(Y|X)] 
= E(t”) + Var(X) 
=1 + o. 
Hence the distribution of Y is the normal distribution with mean mw and variance 
tT +0%. < 


Linear Combinations 


Heights of Husbands and Wives. Suppose that a married couple is selected at random 
from a certain population of married couples and that the joint distribution of the 
height of the wife and the height of her husband is a bivariate normal distribution. 
What is the probability that, in the randomly chosen couple, the wife is taller than 
the husband? < 


The question asked at the end of Example 5.10.5 can be expressed in terms of 
the distribution of the difference between a wife’s and husband’s heights. This is a 
special case of a linear combination of a bivariate normal vector. 


Linear Combination of Bivariate Normals. Suppose that two random variables X, and 
X, have a bivariate normal distribution, for which the p.d.f. is specified by Eq. (5.10.2). 
Let Y =a,X, + a)X7 +b, where a, ay, and b are arbitrary given constants. Then Y 
has the normal distribution with mean a4j44 + a). + b and variance 


ayoy + anes + 2aya7p0409. (5.10.9) 


Proof According to Theorem 5.10.2, both X; and X> can be represented, as in 
Eq. (5.10.1), as linear combinations of independent and normally distributed random 
variables Z, and Z,. Since Y is a linear combination of X, and X>, it follows that 
Y can also be represented as a linear combination of Z, and Z>. Therefore, by 
Corollary 5.6.1, the distribution of Y will also be anormal distribution. It only remains 
to compute the mean and variance of Y. The mean of Y is 


E(Y) =a E(X1) + an E(X2) +b 
= ayy + anya + b. 


Example 
5.10.6 
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It also follows from Corollary 4.6.1 that 
Var(Y) = a Var(X 1) + as Var(X7) + 2aja7 Cov(X 1, X). 
That Var(Y) is given by Eq. (5.10.9) now follows easily. r 


Heights of Husbands and Wives. Consider again Example 5.10.5. Suppose that the 
heights of the wives have a mean of 66.8 inches and a standard deviation of 2 inches, 
the heights of the husbands have a mean of 70 inches and a standard deviation of 2 
inches, and the correlation between these two heights is 0.68. We shall determine the 
probability that the wife will be taller than her husband. 

If we let X denote the height of the wife, and let Y denote the height of her 
husband, then we must determine the value of Pr(X — Y > 0). Since X and Y have 
a bivariate normal distribution, it follows that the distribution of X — Y will be the 
normal distribution, with mean 


E(X — Y) = 66.8 — 70 = —3.2 
and variance 
Var(X — Y) = Var(X) + Var(Y) — 2 Cov(Xx, Y) 
=4+4+ 4 — 2(0.68)(2)(2) = 2.56. 
Hence, the standard deviation of X — Y is 1.6. 
The random variable Z = (X — Y + 3.2)/(1.6) will have the standard normal 

distribution. It can be found from the table given at the end of this book that 

Pr(X¥ — ¥ > 0) =Pr(Z > 2) =1-— 002) 

= 0.0227. 


Therefore, the probability that the wife will be taller than her husband is 0.0227. < 


Summary 


If a random vector (X, Y) has a bivariate normal distribution, then every linear 
combination aX + bY +c has a normal distribution. In particular, the marginal 
distributions of X and Y are normal. Also, the conditional distribution of X given 
Y = y is normal with the conditional mean being a linear function of y and the 
conditional variance being constant in y. (Similarly, for the conditional distribution 
of Y given X = x.) Amore thorough treatment of the bivariate normal distributions 
and higher-dimensional generalizations can be found in the book by D. F. Morrison 
(1990). 


Exercises 


1. Consider again the joint distribution of heights of hus- 
bands and wives in Example 5.10.6. Find the 0.95 quantile 
of the conditional distribution of the height of the wife 
given that the height of the husband is 72 inches. 


2. Suppose that two different tests A and B are to be given 
to a student chosen at random from a certain population. 
Suppose also that the mean score on test A is 85, and the 


standard deviation is 10; the mean score on test B is 90, 
and the standard deviation is 16; the scores on the two tests 
have a bivariate normal distribution; and the correlation 
of the two scores is 0.8. If the student’s score on test A is 
80, what is the probability that her score on test B will be 
higher than 90? 
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3. Consider again the two tests A and B described in Ex- 
ercise 2. If a student is chosen at random, what is the 
probability that the sum of her scores on the two tests will 
be greater than 200? 


4. Consider again the two tests A and B described in Ex- 
ercise 2. If a student is chosen at random, what is the 
probability that her score on test A will be higher than 
her score on test B? 


5. Consider again the two tests A and B described in Ex- 
ercise 2. If a student is chosen at random, and her score 
on test B is 100, what predicted value of her score on test 
A has the smallest M.S.E., and what is the value of this 
minimum M.S.E.? 


6. Suppose that the random variables X; and Xz have 
a bivariate normal distribution, for which the joint p.d.f. 
is specified by Eq. (5.10.2). Determine the value of the 
constant b for which Var(X, + bX) will be a minimum. 


7. Suppose that X, and X> have a bivariate normal dis- 
tribution for which E(X,|X2) = 3.7 — 0.15X9, E(X2|X1) = 
0.4 — 0.6X 1, and Var(X>|X 1) = 3.64. Find the mean and the 
variance of X,, the mean and the variance of X>, and the 
correlation of X; and X>. 


8. Let f (x1, x2) denote the p.d-f. of the bivariate normal 
distribution specified by Eq. (5.10.2). Show that the max- 
imum value of f (x1, x2) is attained at the point at which 
y= and Xo = L2. 


9. Let f(x, x2) denote the p.d.f. of the bivariate normal 
distribution specified by Eq. (5.10.2), and let k be a con- 
stant such that 


1 


O<k< ; 
2m (1 — p?)"0409 


Show that the points (x1, x2) such that f (2x1, x2) =k lieona 
circle if o = 0 and 0; = 09, and these points lie on an ellipse 
otherwise. 


10. Suppose that two random variables X; and X, have 
a bivariate normal distribution, and two other random 
variables Y; and Y> are defined as follows: 

Yi = ay X1 + ay2Xq + by, 

Yp = ay1X1 + a72Xq + bo, 


where 


a1 412 
421 422 


Show that Y; and Y> also have a bivariate normal distribu- 
tion. 


11. Suppose that two random variables X; and X> have 
a bivariate normal distribution, and Var(X,) = Var(X 9). 
Show that the sum X, + X> and the difference X; — X) 
are independent random variables. 


12. Suppose that the two measurements from flea beetles 
in Example 5.10.2 have the bivariate normal distribution 
with wy = 201, “2 = 118, 0; = 15.2, op = 6.6, and p = 0.64. 
Suppose that the same two measurements from a second 
species also have the bivariate normal distribution with 
fy = 187, 2 = 131, 0, = 15.2, op = 6.6, and p = 0.64. Let 
(X1, X>) be a pair of measurements on a flea beetle from 
one of these two species. Let a), a) be constants. 


a. For each of the two species, find the mean and stan- 
dard deviation of a,X, + a7X . (Note that the vari- 
ances for the two species will be the same. How do 
you know that?) 

b. Find a; and az to maximize the ratio of the difference 
between the two means found in part (a) to the stan- 
dard deviation found in part (a). There is a sense in 
which this linear combination a,X 1+ a)X> does the 
best job of distinguishing the two species among all 
possible linear combinations. 


13. Suppose that the joint p.d.f. of two random variables 
X and Y is proportional, as a function of (x, y), to 


exp(—[ax? + by* +cexy tex +gy+ h}) ; 


where a> 0, b> 0, and c, e, g, and h are all constants. 
Assume that ab > (c/2)2. Prove that X and Y have a bi- 
variate normal distribution, and find the means, variances, 
and correlation. 


14. Suppose that a random variable X has a normal dis- 
tribution, and for every x, the conditional distribution of 
another random variable Y given that X = x is a normal 
distribution with mean ax +b and variance rt, where a, 
b, and tr? are constants. Prove that the joint distribution of 
X and Y is a bivariate normal distribution. 


15. Let X1,..., X,, be i.i.d. random variables having the 
normal distribution with mean jz and variance o7. Define 
X, = 4 >=" X;, the sample mean. In this problem, we 
shall find the conditional distribution of each X; given X,,. 


a. Show that X; and X,, have the bivariate normal dis- 
tribution with both means j, variances o2 ando? /n, 
and correlation 1/,/n. Hint: Let Y =)? ,4; Xj. Now 
show that Y and X; are independent normals and Xx, 
and X; are linear combinations of Y and X;. 

b. Show that the conditional distribution of X; given 
X,, =X, is normal with mean X,, and variance o7(1 — 


1/n). 


5.11 


1. Let X and P be random variables. Suppose that the 
conditional distribution of X given P = p is the binomial 
distribution with parameters n and p. Suppose that the 
distribution of P is the beta distribution with parameters 
a =1and 6 = 1. Find the marginal distribution of X. 


2. Suppose that X, Y, and Z are i.i.d. random variables 
and each has the standard normal distribution. Evaluate 
PrG¥ +2¥ <627—7), 


3. Suppose that X and Y are independent Poisson random 
variables such that Var(X) + Var(Y) =5. Evaluate Pr(X + 
Y <2). 


4. Suppose that X has a normal distribution such that 
Pr(X < 116) = 0.20 and Pr(X < 328) = 0.90. Determine 
the mean and the variance of X. 


5. Suppose that a random sample of four observations is 
drawn from the Poisson distribution with mean A, and let 
X denote the sample mean. Show that 


Pr (x < ;) = (444+ Ne™. 


6. The lifetime X of an electronic component has the 
exponential distribution such that Pr(X < 1000) = 0.75. 
What is the expected lifetime of the component? 


7. Suppose that X has the normal distribution with mean 
wand variance o°. Express E(X 3) in terms of pz and o?. 


8. Suppose that a random sample of 16 observations is 
drawn from the normal distribution with mean y and stan- 
dard deviation 12, and that independently another ran- 
dom sample of 25 observations is drawn from the normal 
distribution with the same mean yw and standard devia- 
tion 20. Let X and Y denote the sample means of the two 
samples. Evaluate Pr(|X — Y| <5). 


9. Suppose that men arrive at a ticket counter according 
toa Poisson process at the rate of 120 per hour, and women 
arrive according to an independent Poisson process at the 
rate of 60 per hour. Determine the probability that four 
or fewer people arrive in a one-minute period. 


10. Suppose that X;, X2,... are i.i.d. random variables, 
each of which has m.g.f. w(t). Let Y= X,+---+ Xy, 
where the number of terms N in this sum is a random 
variable having the Poisson distribution with mean 2d. 
Assume that N and Xj, X>,...areindependent, and Y =0 
if N = 0. Determine the m.g.f. of Y. 


11. Every Sunday morning, two children, Craig and Jill, 
independently try to launch their model airplanes. On 
each Sunday, Craig has probability 1/3 of a successful 
launch, and Jill has probability 1/5 of a successful launch. 
Determine the expected number of Sundays required un- 
til at least one of the two children has a successful launch. 
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12. Suppose that a fair coin is tossed until at least one head 
and at least one tail have been obtained. Let X denote the 
number of tosses that are required. Find the p.f. of X. 


13. Suppose that a pair of balanced dice are rolled 120 
times, and let X denote the number of rolls on which the 
sum of the two numbers is 12. Use the Poisson approxi- 
mation to approximate Pr(X = 3). 


14. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution on the interval [0, 1]. Let Y, = 
min{X;,..., X,}, Y, = max{X;,..., X,}, and W=Y, — 
Y,. Show that each of the random variables Yj, Y,,, and W 
has a beta distribution. 


15. Suppose that events occur in accordance with a Pois- 
son process at the rate of five events per hour. 


a. Determine the distribution of the waiting time 7, 
until the first event occurs. 


b. Determine the distribution of the total waiting time 
T, until k events have occurred. 


c. Determine the probability that none of the first k 
events will occur within 20 minutes of one another. 


16. Suppose that five components are functioning simul- 
taneously, that the lifetimes of the components are i.i.d., 
and that each lifetime has the exponential distribution 
with parameter 6. Let T; denote the time from the begin- 
ning of the process until one of the components fails; and 
let 7; denote the total time until all five components have 
failed. Evaluate Cov(7j, Ts). 


17. Suppose that X, and X, are independent random vari- 
ables, and X; has the exponential distribution with param- 
eter 6; (i = 1, 2). Show that for each constant k > 0, 


Bo 
kB, + Bo 


18. Suppose that 15,000 people in a city with a population 
of 500,000 are watching a certain television program. If 
200 people in the city are contacted at random, what is 
the approximate probability that fewer than four of them 
are watching the program? 


Pr(X, > kX) = 


19. Suppose that it is desired to estimate the proportion of 
persons in a large population who have a certain charac- 
teristic. A random sample of 100 persons is selected from 
the population without replacement, and the proportion 
X of persons in the sample who have the characteristic is 
observed. Show that, no matter how large the population 
is, the standard deviation of X is at most 0.05. 


20. Suppose that X has the binomial distribution with 
parameters n and p, and that Y has the negative binomial 
distribution with parameters r and p, where r is a positive 
integer. Show that Pr(X <r) =Pr(Y >n —r) by showing 
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that both the left side and the right side of this equation 
can be regarded as the probability of the same event in a 
sequence of Bernoulli trials with probability p of success. 


21. Suppose that X has the Poisson distribution with mean 
At, and that Y has the gamma distribution with parameters 
a =k and B =A, where k is a positive integer. Show that 
Pr(X > k) =Pr(Y <t) by showing that both the left side 
and the right side of this equation can be regarded as the 
probability of the same event in a Poisson process in which 
the expected number of occurrences per unit of time is 4. 


22. Suppose that X is a random variable having a contin- 
uous distribution with p.d.f. f(x) and c.d.f. F(x), and for 
which Pr(X > 0) = 1. Let the failure rate h(x) be as defined 
in Exercise 18 of Sec. 5.7. Show that 


exp] - [ h(t) ar| =1- F(x). 
0 


23. Suppose that 40 percent of the students in a large pop- 
ulation are freshmen, 30 percent are sophomores, 20 per- 
cent are juniors, and 10 percent are seniors. Suppose that 


10 students are selected at random from the population, 
and let X;, X, X3, X4 denote, respectively, the numbers 
of freshmen, sophomores, juniors, and seniors that are ob- 
tained. 


a. Determine p(X;, X ;) for each pair of values i and j 
(i <j). 

b. For what values of i and j (i < j) is p(X;, X;) most 
negative? 

¢c. For what values of i and j (i < j) is p(X;, X ;) closest 
to 0? 


24. Suppose that X; and X> have the bivariate normal 
distribution with means j1 and j/2, variances of and a5, 
and correlation p. Determine the distribution of X; — 3X. 


25. Suppose that X has the standard normal distribution, 
and the conditional distribution of Y given X is the normal 
distribution with mean 2X — 3 and variance 12. Determine 
the marginal distribution of Y and the value of p(X, Y). 


26. Suppose that X, and X> have a bivariate normal dis- 
tribution with E(X,) = 0. Evaluate E(X?7X). 


LARGE RANDOM SAMPLES 


Chapter 


6.1 Introduction 


6.4 The Correction for Continuity 


6.2 The Law of Large Numbers 6.5 Supplementary Exercises 
6.3. The Central Limit Theorem 


Example 
6.1.1 


6.1 Introduction 


In this chapter, we introduce a number of approximation results that simplify the 
analysis of large random samples. In the first section, we give two examples to 
illustrate the types of analyses that we might wish to perform and how additional 
tools may be needed to be able to perform them. 


Proportion of Heads. If you draw a coin from your pocket, you might feel confident 
that it is essentially fair. That is, the probability that it will land with head up when 
flipped is 1/2. However, if you were to flip the coin 10 times, you would not expect 
to see exactly 5 heads. If you were to flip it 100 times, you would be even less likely 
to see exactly 50 heads. Indeed, we can calculate the probabilities of each of these 
two results using the fact that the number of heads in n independent flips of a fair 
coin has the binomial distribution with parameters n and 1/2. So, if X is the number 
of heads in 10 independent flips, we know that 


5 Bi] 
prax =5)=(") (2)' (1-2) 02s 


If Y is the number of heads in 100 independent flips, we have 


50 50 
Pr(Y =50) = ee (5) (1 — ;) = 0.0796. 
50 2 2 


Even though the probability of exactly n/2 heads in n flips is quite small, especially 
for large n, you still expect the proportion of heads to be close to 1/2 if n is large. For 
example, if nm = 100, the proportion of heads is Y/100. In this case, the probability 
that the proportion is within 0.1 of 1/2 is 
60 i 100-i 
Pr (o4 < is < 06) = Pr(40 < Y < 60) = > (“") (5) (1 - *) = 0.9648. 
100 i 2 2 
i=40 
A similar calculation with n = 10 yields 


6 i 10-i 
Pr (04 < = < 06) =Pr(4<Y<6)= » ("") (5) (1 = ;) = 0.6563. 
10 ais 2 2 


Notice that the probability that the proportion of heads in n tosses is close to 1/2 is 
larger for n = 100 than for n = 10 in this example. This is due in part to the fact that 
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we have defined “close to 1/2” to be the same for both cases, namely, between 0.4 
and 0.6. | 


The calculations performed in Example 6.1.1 were simple enough because we 
have a formula for the probability function of the number of heads in any number 
of flips. For more complicated random variables, the situation is not so simple. 


Average Waiting Time. A queue is serving customers, and the ith customer waits a 
random time X; to be served. Suppose that X,, X,... are i.1.d. random variables 
having the uniform distribution on the interval [0, 1]. The mean waiting time is 0.5. 
Intuition suggests that the average of a large number of waiting times should be 
close to the mean waiting time. But the distribution of the average of X,,..., X,, is 
rather complicated for every n > 1. It may not be possible to calculate precisely the 
probability that the sample average is close to 0.5 for large samples. 4 


The law of large numbers (Theorem 6.2.4) will give a mathematical foundation 
to the intuition that the average of a large sample of i.i.d. random variables, such as 
the waiting times in Example 6.1.2, should be close to their mean. The central limit 
theorem (Theorem 6.3.1) will give us a way to approximate the probability that the 
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Example 
6.1.2 
sample average is close to the mean. 
Exercises 


1. The solution to Exercise 1 of Sec. 3.9 is the p.d.f. of X; + 
X> in Example 6.1.2. Find the p.d.f. of X> = (X; + X>)/2. 
Compare the probabilities that X7 and X, are close to 0.5. 
In particular, compute Pr(|X> — 0.5] < 0.1) and Pr(|X, — 
0.5| < 0.1). What feature of the p.d.f. of X> makes it clear 
that the distribution is more concentrated near the mean? 


2. Let X,, Xo,... be a sequence of i.i.d. random vari- 
ables having the normal distribution with mean jw and 
variance o”. Let X, = i >o"_, X; be the sample mean of 
the first n random variables in the sequence. Show that 


Pr(|X,, — “| < c) converges to 1 asn — oo. Hint: Write the 
probability in terms of the standard normal c.d.f. ® and use 
what you know about this c.d_f. 


3. This problem requires a computer program because the 
calculation is too tedious to do by hand. Extend the cal- 
culation in Example 6.1.1 to the case of n = 200 flips. That 
is, let W be the number of heads in 200 flips of a fair coin, 


and compute Pr (0.4 < ty < 0.6). What do you think is 


the continuation of the pattern of these probabilities as 
the number of flips n increases without bound? 


6.2 The Law of Large Numbers 


The average of a random sample of i.i.d. random variables is called their sample 
mean. The sample mean is useful for summarizing the information in a random 
sample in much the same way that the mean of a probability distribution summa- 
rizes the information in the distribution. In this section, we present some results 
that illustrate the connection between the sample mean and the expected value of 
the individual random variables that comprise the random sample. 


The Markov and Chebyshev Inequalities 


We shall begin this section by presenting two simple and general results, known 
as the Markov inequality and the Chebyshev inequality. We shall then apply these 
inequalities to random samples. 


Theorem 
6.2.1 


Theorem 
6.2.2 


6.2 The Law of Large Numbers 349 


The Markov inequality is related to the claim made on page 211 about how the 
mean of a distribution can be affected by moving a small amount of probability to an 
arbitrarily large value. The Markov inequality puts a bound on how much probability 
can be at arbitrarily large values once the mean is specified. 


Markov Inequality. Suppose that X is arandom variable such that Pr(X > 0) = 1. Then 
for every real number ¢ > 0, 


Pr(x >t) < Be (6.2.1) 
t 


Proof For convenience, we shall assume that X has a discrete distribution for which 
the p.f. is f. The proof for a continuous distribution or a more general type of 
distribution is similar. For a discrete distribution, 


E(X) =) > xf(x) => xf) + >> xf@). 
x xX<t x>t 


Since X can have only nonnegative values, all the terms in the summations are 
nonnegative. Therefore, 


E(X)> » xf (x) > > tf (x) =t Pr(X > 1). (6.2.2) 
x>t x>t 
Divide the extreme ends of (6.2.2) by t > 0 to obtain (6.2.1). rT 


The Markov inequality is primarily of interest for large values of r. In fact, when 
t < E(X), the inequality is of no interest whatsoever, since it is known that Pr(x < 
t) < 1. However, it is found from the Markov inequality that for every nonnegative 
random variable X whose mean is 1, the maximum possible value of Pr(X > 100) is 
0.01. Furthermore, it can be verified that this maximum value is actually attained by 
every random variable X for which Pr(X = 0) = 0.99 and Pr(X = 100) = 0.01. 

The Chebyshev inequality is related to the idea that the variance of a random 
variable is a measure of how spread out its distribution is. The inequality says that the 
probability that X is far away from its mean is bounded by a quantity that increases 
as Var(X) increases. 


Chebyshev Inequality. Let X be a random variable for which Var(X) exists. Then for 
every number ¢ > 0, 


Var (X) 


PrlX = E(X)| 21) = —| 
t 


(6.2.3) 


Proof Let Y =[X — E(X)}. Then Pr(Y > 0) =1 and E(Y) = Var(X). By applying 
the Markov inequality to Y, we obtain the following result: 


Var(X) 
tz 


Pr(|X — E(X)| >t) =Pr(¥ > 1°) < 7 

It can be seen from this proof that the Chebyshev inequality is simply a special 
case of the Markov inequality. Therefore, the comments that were given following 
the proof of the Markov inequality can be applied as well to the Chebyshev inequal- 
ity. Because of their generality, these inequalities are very useful. For example, if 
Var(X) = o? and we let t = 3c, then the Chebyshev inequality yields the result that 


Pr(|X — E(X)| > 30) < * 
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Theorem 
6.2.3 


Example 
6.2.1 


In words, the probability that any given random variable will differ from its mean by 
more than 3 standard deviations cannot exceed 1/9. This probability will actually be 
much smaller than 1/9 for many of the random variables and distributions that will 
be discussed in this book. The Chebyshev inequality is useful because of the fact that 
this probability must be 1/9 or less for every distribution. It can also be shown (see 
Exercise 4 at the end of this section) that the upper bound in (6.2.3) is sharp in the 
sense that it cannot be made any smaller and still hold for al/ distributions. 


Properties of the Sample Mean 


In Definition 5.6.3, we defined the sample mean of n random variables X1,..., X, 
to be their average, 


=> 1 
Xn= roe eos + Xa). 
The mean and the variance of X,, are easily computed. 


Mean and Variance of the Sample Mean. Let X;,..., X,, be a random sample from 
a distribution with mean yw and variance o”. Let X,, be the sample mean. Then 
E(X,,) = and Var(X,,) = 02/n. 


Proof It follows from Theorems 4.2.1 and 4.2.4 that 


-_ 1 n 1 
E(X,)=- > E(X;)=—-nw=uw. 
n i=l n 
Furthermore, since X;,..., X,, are independent, Theorems 4.3.4 and 4.3.5 say that 


In words, the mean of X,, is equal to the mean of the distribution from which the 
random sample was drawn, but the variance of X, is only 1/n times the variance 
of that distribution. It follows that the probability distribution of X,, will be more 
concentrated around the mean value jz than was the original distribution. In other 
words, the sample mean X,, is more likely to be close to yz than is the value of just a 
single observation X; from the given distribution. 

These statements can be made more precise by applying the Chebyshev inequal- 
ity to X,,. Since E(X,,) = w and Var(X,,) = 07/n, it follows from the relation (6.2.3) 
that for every number ¢ > 0, 


o2 


Pr(|X,, ~~ [| = t) < ne 


(6.2.4) 


Determining the Required Number of Observations. Suppose that a random sample is 
to be taken from a distribution for which the value of the mean p is not known, but for 
which it is known that the standard deviation o is 2 units or less. We shall determine 
how large the sample size must be in order to make the probability at least 0.99 that 
|X, — | will be less than 1 unit. 


Example 
6.2.2 


Example 
6.2.3 
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Since o” < 2? =4, it follows from the relation (6.2.4) that for every sample size n, 


o2 


Pr(IX, —ul|>)<—< 
n 


SR 


Since n must be chosen so that Pr(|X,, — | < 1) > 0.99, it follows that n must be 
chosen so that 4/n < 0.01. Hence, it is required that n > 400. < 


A Simulation. An environmental engineer believes that there are two contaminants 
in a water supply, arsenic and lead. The actual concentrations of the two contami- 
nants are independent random variables X and Y, measured in the same units. The 
engineer is interested in what proportion of the contamination is lead on average. 
That is, the engineer wants to know the mean of R = Y/(X + Y). We suppose that it 
is a simple matter to generate as many independent pseudo-random numbers with 
the distributions of X and Y as we desire. A common way to obtain an approximation 
to E[Y/(X + Y)] would be the following: If we sample n pairs (Xj, Y)),..., (X,,. Y,) 
and compute R; = Y;/(X; + Y,;) fori=1,...,n, then R, = 1 >", 8; is a sensible 
approximation to E(R). To decide how large n should be, we can argue as in Ex- 
ample 6.2.1. Since it is known that |R;| < 1, it must be that Var(R;) < 1. (Actually, 
Var(R;) < 1/4, but this is harder to prove. See Exercise 14 in this section for a way to 
prove it in the discrete case.) According to Chebyshev’s inequality, for each « > 0, 


= 1 
Pr( IR, — E(R)| > €) < —. 


So, if we want |R,, — E(R)| < 0.005 with probability 0.98 or more, then we should use 
n> 1/[0.2 x 0.0057] = 2,000,000. < 


It should be emphasized that the use of the Chebyshev inequality in Exam- 
ple 6.2.1 guarantees that a sample for which n = 400 will be large enough to meet the 
specified probability requirements, regardless of the particular type of distribution 
from which the sample is to be taken. If further information about this distribution 
is available, then it can often be shown that a smaller value for n will be sufficient. 
This property is illustrated in the next example. 


Tossing a Coin. Suppose that a fair coin is to be tossed n times independently. For 
i=1,...,n, let X;=1if a head is obtained on the ith toss, and let X; = 0 if a tail 
is obtained on the ith toss. Then the sample mean X,, will simply be equal to the 
proportion of heads that are obtained on the n tosses. We shall determine the number 
of times the coin must be tossed in order to make Pr(0.4 < X,, < 0.6) > 0.7. We shall 
determine this number in two ways: first, by using the Chebyshev inequality; second, 
by using the exact probabilities for the binomial distribution of the total number of 
heads. 

Let T = )°/_, X; denote the total number of heads that are obtained when n 
tosses are made. Then T has the binomial distribution with parameters n and p = 1/2. 
Therefore, it follows from Eq. (4.2.5) on page 221 that E(T) =n/2, and it follows 
from Eq. (4.3.3) on page 232 that Var(T) =n/4. Because X,, = T/n, we can obtain 
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Definition 
6.2.1 


Theorem 
6.2.4 


the following relation from the Chebyshev inequality: 


Pr(0.4 < X,, < 0.6) = Pr(0.4n <T <0.6n) 


— Pr (|r = <0.1n) 
2 


>1 a ees 
4(0.1n)2 n 
Hence, if n > 84, this probability will be at least 0.7, as required. 


However, from the table of binomial distributions given at the end of this book, 
it is found that for n = 15, 


Pr(0.4 < X,, < 0.6) = Pr(6 <T <9) =0.70. 


Hence, 15 tosses would actually be sufficient to satisfy the specified probability 
requirement. < 


The Law of Large Numbers 


The discussion in Example 6.2.3 indicates that the Chebyshev inequality may not be 
a practical tool for determining the appropriate sample size in a particular problem, 
because it may specify a much greater sample size than is actually needed for the 
particular distribution from which the sample is being taken. However, the Cheby- 
shev inequality is a valuable theoretical tool, and it will be used here to prove an 
important result known as the law of large numbers. 

Suppose that Z;, Z>, ... is asequence of random variables. Roughly speaking, it 
is said that this sequence converges to a given number b if the probability distribution 
of Z,, becomes more and more concentrated around b as n — oo. To be more precise, 
we give the following definition. 


Convergence in Probability. A sequence Z,, Z>, ... of random variables converges to 
b in probability if for every number e > 0, 
lim Pr(|Z, —b| <e)=1. 
noo 
This property is denoted by 
Zs. 
and is sometimes stated simply as Z,, converges to b in probability. 

In other words, Z,, converges to b in probability if the probability that Z,, lies in 
each given interval around b, no matter how small this interval may be, approaches 
lasn>o. 

We shall now show that the sample mean of a random sample with finite variance 


always converges in probability to the mean of the distribution from which the 
random sample was taken. 


Law of Large Numbers. Suppose that X;,..., X, form a random sample from a 
distribution for which the mean is yz and for which the variance is finite. Let X,, denote 
the sample mean. Then 


a, (6.2.5) 


Theorem 
6.2.5 


Theorem 
6.2.6 
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Proof Let the variance of each X; be o”. It then follows from the Chebyshev inequal- 
ity that for every number « > 0, 


= o2 
Pr(|X,, — pI <€) = 1— ame! 
né 
Hence, 
lim Pr(|X, — u| <e) =1, 
noo 
which means that X, —> u. 2 


It can also be shown that Eq. (6.2.5) is satisfied if the distribution from which the 
random sample is taken has a finite mean jz but an infinite variance. However, the 
proof for this case is beyond the scope of this book. 

Since X,, converges to jz in probability, it follows that there is high probability that 
X,, will be close to yu if the sample size n is large. Hence, if a large random sample is 
taken from a distribution for which the mean is unknown, then the arithmetic average 
of the values in the sample will usually be a close estimate of the unknown mean. 
This topic will be discussed again in Sec. 6.3, where we introduce the central limit 
theorem. It will then be possible to present a more precise probability distribution 
for the difference between X,, and pu. 

The following result can be useful if we observe random variables with mean jz 
but are interested in pu? or log(jz) or some other continuous function of jz. The proof 
is left for the reader (Exercise 15). 


Continuous Functions of Random Variables. If Z,, +, b, and if g(z) is a function that 
is continuous at z = b, then g(Z,,) = g(b). r 


Similarly, it is almost as easy to show that if Z,, —, band Y, —’. c, and if g(z, y) is 
continuous at (z, y) = (b, c), then g(Z,,, Y,) ms g(b, c) (Exercise 16). Indeed, Theo- 
rem 6.2.5 extends to any finite number k of sequences that converge in probability 
and a continuous function of k variables. 

The law of large numbers helps to explain why a histogram (Definition 3.7.9) can 
be used as an approximation to a p.d.f. 


Histograms. Let X,, X>,... be a sequence of i.i.d. random variables. Let c; < cy be 
two constants. Define Y; = 1 if c, < X; < cy and Y,; = O if not. Then Y, = i pees 
is the proportion of X;,..., X, that lie in the interval [c;,c.), and Y, oa 


Pr(cy < XxX < C9). 


Proof By construction, Yj, Y>,...areii.d. Bernoulli random variables with param- 
eter p = Pr(cy < X; <>). Theorem 6.2.4 says that Y,, za D. a 


In words, Theorem 6.2.6 says the following: If we draw a histogram with the area 
of the bar over each subinterval being the proportion of a random sample that lies 
in the corresponding subinterval, then the area of each bar converges in probability 
to the probability that a random variable from the sequence lies in the subinterval. 
If the sample is large, we would then expect the area of each bar to be close to the 
probability. The same idea applies to a conditionally i.i.d. (given Z = z) sample, with 
Pr(cy < X1 < cy) replaced by Pr(c; < X1 <cy|Z =z). 
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Figure 6.1 Histogram of ser- 
vice times for Example 6.2.4 
together with graph of the 
conditional p.d.f. from which 
the service times were simu- 
lated. 


Example 
6.2.4 


Example 
6.2.5 


oS 2 
nN w 
nA ss 


Proportion 
—) 


Rate of Service. In Example 3.7.20, we drew a histogram of an observed sample of 
n = 100 service times. The service times were actually simulated as an i.i.d. sample 
from the exponential distribution with parameter 0.446. Figure 6.1 reproduces the 
histogram overlayed with the graph of g(x|z)) where zy) = 0.446. Because the width 
of each bar is 1, the area of each bar equals the proportion of the sample that lies in the 
corresponding interval. The area under the curve g(x|zo) is Pr(cy < X1 < Co|Z = 2) 
for each interval [c;, cy). Notice how closely the area under the conditional p.d.f. 
matches the area of each bar. <l 


The reason that the p.d.f. and the heights of the bars in the histogram in Fig. 6.1 
match so closely is that the area of each bar is converging in probablity to the area 
under the graph of the p.d.f. The sum of the areas of the bars is 1, which is the same 
as the area under the graph of the p.d.f. If we had chosen the heights of the bars in 
the histogram to represent counts, then the sum of the areas of the bars would have 
been n = 100, and the bars would have been about 100 times as high as the p.d.f. 

We could choose a different width for the subintervals in the histogram and still 
keep the areas equal to the proportions in the subintervals. 


Rate of Service. In Example 6.2.4, we can choose 20 bars of width 0.5 instead of 10 bars 
of width 1. To make the area of each bar represent the proportion in the subinterval, 
the height of each bar should equal the proportion divided by 0.5. The probability of 
an observation being in each interval [c,, cy) would be 


2) 
Pr(cy < X14 <c|Z =x) = g(x|z)dx & (cy — cy) g([ey + ©]/21z) 


e 
= 0.5 *« g([cy + co]/2|z). (6.2.6) 


Recall that the probability in (6.2.6) should be close to the proportion of the sample 
in the interval. If we divide both the probability and the proportion by 0.5, we see 
that the height of the histogram bar should be close to g([c; + c)]/2). Hence, the 
graph of the p.d.f. should still be close to the heights of the histogram bars. What 
we are doing here is choosing r = n(b — a)/k in Defintion 3.7.9. Figure 6.2 shows the 
histogram with 20 intervals of length 0.5 together with the same p.d.f. from Fig. 6.1. 
The bar heights are still similar to the p.d.f., but they are much more variable in 


Figure 6.2 Modified his- 
togram of service times from 
Example 6.2.4 together with 
graph of the conditional p.d-f. 
This time, the width of each 
interval is 0.5. 


o, 
“ 
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Fig. 6.2 compared to Fig. 6.1. Exercise 17 helps to explain why the bar heights are 
more variable in this example. <1 


The reasoning used to construct Figures 6.1 and 6.2 applies even when the 
subintervals used to construct the histogram have different widths. In this case, each 
bar should have height equal to the raw count divided by both n (the sample size) 
and the width of the corresponding subinterval. 


Weak Laws and Strong Laws 


There are other concepts of the convergence of a sequence of random variables, 
in addition to the concept of convergence in probability that has been presented 
above. For example, it is said that a sequence Z,, Z>, ... converges to a constant b 
with probability 1 if 


Pr (lim Z,=b) =1. 
N—> oo 

A careful investigation of the concept of convergence with probability 1 is be- 
yond the scope of this book. It can be shown that if a sequence Z , Z5, ...converges to 
b with probability 1, then the sequence will also converge to b in probability. For this 
reason, convergence with probability 1 is often called strong convergence, whereas 
convergence in probability is called weak convergence. In order to emphasize the 
distinction between these two concepts of convergence, the result that here has been 
called simply the law of large numbers is often called the weak law of large numbers. 
The strong law of large numbers can then be stated as follows: If X,, is the sample 
mean of a random sample of size n from a distribution with mean jz, then 


n—- Oo 


Pr ( lim X, =1) =i 


The proof of this result will not be given here. There are examples of sequences of 
random variables that converge in probability but that do not converge with proba- 
bility 1. Exercise 22 is one such example. Another type of converges is convergence 
in quadratic mean, which is introduced in Exercises 10-13. 
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Chernoff Bounds 


Example 
6.2.6 


One way to think of the Chebyshev inequality is as an application of the Markov 
inequalitty to the random variable (X — jz)”. This idea generalizes to other functions 
and leads to a sharper bound on the probability in the tail of a distribution when the 
bound applies. Before giving the general result, we give a simple example to illustrate 
the potential improvement that it can provide. 


Binomial Random Variable. Suppose that X has the binomial distribution with param- 
eters n and 1/2. We would like a bound to the probability that X/n is far from its 
mean 1/2. To be specific, suppose that we would like a bound for 


pr (|* - 3|> =) (6.2.7) 


n  2/~ 10 
The Chebyshev inequality gives the bound Var(X/n)/(1/10)*, which equals 25/n. 
Instead of applying the Chebyshev inequality, define Y = X — n/2 and rewrite 
the probability in (6.2.7) as the sum of the following two probabilities: 


- (2 Zo x) _ py (-r - “) (6.2.8) 
n 2 10 10 


For each s > 0, rewrite the first of the probabilities in (6.2.8) as 


n ns 
Pr (v > =) = Pr [expov) > exp (=) 


_ Elexp(sY)] 
~ exp(ns /10) ; 


where the inequality follows from the Markov inequality. This equation involves 
the moment generating function of Y, w(s) = E[exp(sY)]. The m.g.f. of Y can be 
found by applying Theorem 4.4.3 with p = 1/2, a=1, and b = —n/2 together with 
Equation (5.2.4). The result is 


W100) = ( lexp) + texp(-s/) (6.29) 
for all s. Let s = 1/2 in (6.2.9) to obtain the bound 
Pr (v > =) < w(1/2) exp(—n/20) 
= exp(—n/20) € [exp(1/2) + 1] exp(-1/4) = 0.9811”. 
Similarly, we can write the second probability in (6.2.8) as 


n ns 
Pr (-v > “) = Pr |exp(-sy) > exp (=) : (6.2.10) 


where s > 0. The m.g.f. of —Y is y(—s). Let s = 1/2 in (6.2.10) and apply the Markov 
inequality to obtatin the bound 


Theorem 
6.2.7 


Example 
6.2.7 
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Pr ( y> 7) < W(—1/2) exp(—n/20) 


= exp(—n/20) (5 [exp(—1/2) + 1] exp(l/4)) = 0.9811". 


Hence, we obtain the bound 
x ol 


Pr ( 
n 2 


The bound in (6.2.11) decreases exponentially fast as n increases, while the Cheby- 
shev bound 25/n decreases proportionally to 1/n. For example, with n = 100, 200, 
300, the Chebychev bounds are 0.25, 0.125, and 0.0833. The corresponding bounds 
from (6.2.11) are 0.2967, 0.0440, and 0.0065. 4 


> <) <2(0.9811)". (6.2.11) 


The choice of s = 1/2 in Example 6.2.6 was arbitrary. Theorem 6.2.7 says that we 
can replace this arbitrary choice with the choice that leads to the smallest possible 
bound. The proof of Theorem 6.2.7 is a straightforward application of the Markov 
inequality. (See Exercise 18 in this section.) 


Chernoff Bounds. Let X be a random variable with moment generating function y. 
Then, for every real r, 


Pr(Xx >t) < min exp(—st)w(s). | 


Theorem 6.2.7 is most useful when X is the sum of n 1.1.d. random variables each 
with finite m.g.f. and when t = nu for a large value of n and some fixed u. This was 
the case in Example 6.2.6. 


Average of Geometric Random Sample. Suppose that X1, X2,... are ii.d. geometric 
random variables with parameter p. We would like a bound to the probability that 
X,, is far from the mean (1 — p)/p. To be specific, for each fixed u > 0, we would like 
a bound for 


Pr (|x, ae ete “) (6.2.12) 
P 


Let X = )°"_, X; —n(1— p)/p. For each u > 0, Theorem 6.2.7 can be used to bound 
both 


Pr (x, > ee 2 + “) =Pr(X>nu), and 
7) 


Pr (x, < ae “) = Pr(—X > nu). 
Pp 


Since (6.2.12) equals Pr(X > nu) + Pr(—X > nu), the bound we seek is the sum of 
the two bounds that we get for Pr(X > nu) and Pr(—X > nu). 

The m.g.f. of X can be found by applying Theorem 4.4.3 with a=1 and 
b=-—n(1 — p)/p together with Theorem 5.5.3. The result is 


= : os 
The m.g.f. of —X is w(—s). According to Theorem 6.2.7, 
Pr(Xx > nu) < min w(s) exp(—snu). (6.2.14) 
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We find the minimum of ¥(s) exp(—snu) by finding the minimum of its logarithm. 
Using (6.2.13), we get that 


—P 


1 
log[W(s) exp(—snu)] =n {togtn) s log[1 — (1 — p) exp(s)] — sul . 


The deriviative of this expression with respect to s equals 0 at 


¢=—log we a1 24 »)| (6.2.15) 
up+1-—p 


and the second derivative is positive. If uv > 0, then the value of s in (6.2.15) is positive 
and 7(s) is finite. Hence, the value of s in (6.2.15) provides the minimum in (6.2.14). 
That minimum can be expressed as gq” where 


(-+ajp+l—p 


el ) 
up+1—p : 


(6.2.16) 


u+(1—p)/p 
q=[p+u)+1 rl 


and 0 < q < 1. (See Exercise 19 for a proof.) Hence, Pr(X > nu) <q". 

For Pr(—X > nu), we notice first that Pr(—X > nu) = Oif u => (1 — p)/p because 
dr, X; = 0. If u => (1 — p)/p, then the overall bound on (6.2.12) is g”. For 0 <u < 
(1 — p)/p, the value of s that minimizes w(—s) exp(—snu) is 


1- 1- 
s= te | — Pd »|. 


which is positive when 0 <u < (1 — p)/p. The value of min,.9 w(—s) exp(—snu) is 
r”, where 


(=a poe lp 
1—p—up 
and 0 <r <1. Hence, the Chernoff bound is qg” if u > (1 — p)/p and is q" +r” if 


0 <u <(1—p)/p. As such, the bound decreases exponenially fast as n increases. 
This is a marked impovement over the Chebyshev bound, which decreases like a 


—u+(1—p)/p 
al | 


r=[pd—u)+1 Pl 


constant over n. < 
2, 
“e 
Summary 


The law of large numbers says that the sample mean of a random sample converges 
in probability to the mean yz of the individual random variables, if the variance exists. 
This means that the sample mean will be close to yz if the size of the random sample 
is sufficiently large. The Chebyshev inequality provides a (crude) bound on how high 
the probability is that the sample mean will be close to ~. Chernoff bounds can be 
sharper, but are harder to compute. 


1. For each integer n, let X,, be a nonnegative random 
variable with finite mean jz,,. Prove that iflim,_,,, “, = 9, 


then X,, i, 
2. Suppose that X is a random variable for which 
Pr(xX >0)=1and Pr(x = 10) = 1/5. 


Prove that E(X) > 2. 


3. Suppose that X is a random variable for which E(X) = 
10, Pr(Xx <7) =0.2, and Pr(X > 13) =0.3. Prove that 
Var(X) > 9/2. 


4, Let X be a random variable for which E(X) = pw and 
Var(X) = o*. Construct a probability distribution for X 
such that 


Pr(|X — | > 30) = 1/9. 


5. How large arandom sample must be taken from a given 
distribution in order for the probability to be at least 0.99 
that the sample mean will be within 2 standard deviations 
of the mean of the distribution? 


6. Suppose that Xj, ..., X, formarandom sample of size 
n from a distribution for which the mean is 6.5 and the 
variance is 4. Determine how large the value of n must be 
in order for the following relation to be satisfied: 


Pr(6 < X, <7) >0.8. 


7. Suppose that X is a random variable for which E(X) = 
wand E[(X — w)*] = B4. Prove that 


Prix — ln <4. 
t 


8. Suppose that 30 percent of the items in a large manu- 
factured lot are of poor quality. Suppose also that a ran- 
dom sample of n items is to be taken from the lot, and 
let Q, denote the proportion of the items in the sam- 
ple that are of poor quality. Find a value of n such that 
Pr(0.2 < Q,, < 0.4) => 0.75 by using (a) the Chebyshev in- 
equality and (b) the tables of the binomial distribution at 
the end of this book. 


9. Let Z,, Z>, .. . be asequence of random variables, and 
suppose that, forn = 1, 2,..., the distribution of Z,, is as 
follows: 
2 1 1 
Pr(Z, =n*)=— and Pr(Z,=0)=1--. 
n n 


Show that 
lim E(Z,)=0o but Z, —>0. 
n—->oo 


10. It is said that a sequence of random variables Z;, 
Zo, ... converges to a constant b in quadratic mean if 


jim, E[(Zy — b)?]=0. (6.2.17) 


Show that Eq. (6.2.17) is satisfied if and only if 


lim E(Z,)=b and lim Var(Z,) =0. 
n—>oo 


n—->oo 


Hint: Use Exercise 5 of Sec. 4.3. 


11. Prove that if a sequence Z;, Z,... converges to a 
constant b in quadratic mean, then the sequence also con- 
verges to b in probability. 


12. Let X,, be the sample mean of a random sample of 
size n from a distribution for which the mean is jz and the 
variance is o”, where a” < oo. Show that X,, converges to 
jin quadratic mean as n > oo. 


13. Let Z,, Z5, ...beasequence of random variables, and 
suppose that for n =2,3,..., the distribution of Z,, is as 
follows: 


Pr(Z, = *) =1- = and Pr(Z, =n) = a. 
n n2 n2 
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a. Does there exist a constant c to which the sequence 
converges in probability? 


b. Does there exist a constant c to which the sequence 
converges in quadratic mean? 


14. Let f be a pf. for a discrete distribution. Suppose 
that f(x) =0 for x ¢[0, 1]. Prove that the variance of 
this distribution is at most 1/4. Hint: Prove that there is 
a distribution supported on just the two points {0, 1} that 
has variance at least as large as f does and then prove that 
the variance of a distribution supported on {0, 1} is at most 
1/4. 


15. Prove Theorem 6.2.5. 


16. Suppose that Z,, 9 Y,, sc, and g(z, y) is a 
function that is continuous at (z, y) = (b, c). Prove that 
g(Z,,, Y,) converges in probability to g(b, c). 


17. Let X have the binomial distribution with parameters 
nand p. Let Y have the binomial distribution with param- 
eters n and p/k withk > 1. Let Z=kyY. 


a. Show that X and Z have the same mean. 


b. Find the variances of X and Z. Show that, if p is small, 
then the variance of Z is approximately k times as 
large as the variance of X. 


c. Show why the results above explain the higher vari- 
ability in the bar heights in Fig. 6.2 compared to 
Fig. 6.1. 


18. Prove Theorem 6.2.7. 


19, Return to Example 6.2.7. 
a. Prove that the min,.9 W(s) exp(—snu) equals q”, 
where q is given in (6.2.16). 
b. Prove that 0 < g <1. Hint: First, show that 0 <q <1 


ifu = 0. Next, let x = up + 1— pandshow that log(q) 
is a decreasing function of x. 


20. Return to Example 6.2.6. Find the Chernoff bound for 
the probability in (6.2.7). 


21. Let X1, X2,... be a sequence of ii.d. random vari- 
ables having the exponential distribution with parameter 
L Tey 4 xo etch a= 12a 
a. For each u > 1, compute the Chernoff bound on 
Pr(Y,, > nu). 
b. What goes wrong if we try to compute the Chernoff 
bound when u < 1? 


22. In this exercise, we construct an example of a se- 
quence of random variables Z,, such that Z,, —", 0 but 


Pr (jim, z= 0) = (6.2.18) 


That is, Z,, converges in probability to 0, but Z,, does not 
converge to 0 with probability 1. Indeed, Z,, converges to 
0 with probability 0. 
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Let X be arandom variable having the uniform distri- 
bution on the interval [0, 1]. We will construct a sequence 
of functions h,, (x) forn =1, 2,...and define Z, =h,,(X). 
Each function h,, will take only two values, 0 and 1. The 
set of x where h,,(x) = 1 is determined by dividing the in- 
terval [0, 1] into k nonoverlappling subintervals of length 


1/kfork =1,2,..., arranging these intervals in sequence, 
and letting ,(x) = 1 on the nth interval in the sequence 
forn =1,2,.... For each k, there are k nonoverlapping 


subintervals, so the number of subintervals with lengths 
1, 1/2, 1/3,..., 1/kis 


1H 24340 h= OED 


The remainder of the construction is based on this for- 
mula. The first interval in the sequence has length 1, the 
next two have length 1/2, the next three have length 1/3, 
etc. 


a. For each n = 1, 2,..., prove that there is a unique 
positive integer k, such that 


(kn = Dk, kn (ky, ae 1) 
<n< . 
2 2 


b. Foreachn =1,2,..., let j, =n — (k, — Dk,/2.Show 
that j,, takes the values 1,...,k, asm runs through 
14+ (k, — Dk,/2,..-. klk, + 1)/2. 


ce. Define 


1 if Gin ~~ 1)/ky Sx< Jnl kn 


h = 
a) | 0 ifnot. 


Show that, for every x €[0, 1), h,(x) =1 for one 
and only one n among 1 + (k, — Dk, /2,... 5 Ky (Ky + 
1)/2. 

d. Show that Z, =h,(X) takes the value 1 infinitely 
often with probability 1. 


e. Show that (6.2.18) holds. 


f. Show that Pr(Z, =0) =1—-—1/k, and lim,_,.. k, = 
oO. 


g. Show that Z,, yi, 


23. Prove that the sequence of random variables Z, in 
Exercise 22 converges in quadratic mean (definition in 
Exercise 10) to 0. 


24. In this exercise, we construct an example of a se- 
quence of random variables Z, such that Z, converges 
to 0 with probability 1, but Z,, fails to converge to 0 in 
quadratic mean. Let X be a random variable having the 
uniform distribution on the interval [0, 1]. Define the se- 
quence Z,, by Z,, =n” if0 < X <1/nand Z, =0 otherwise. 


a. Prove that Z, converges to 0 with probability 1. 


b. Prove that Z,, does not converge to 0 in quadratic 
mean. 


6.3 The Central Limit Theorem 


The sample mean of a large random sample of random variables with mean ju 
and finite variance 0” has approximately the normal distribution with mean 
and variance o*/n. This result helps to justify the use of the normal distribution 
as a model for many random variables that can be thought of as being made up 
of many independent parts. Another version of the central limit theorem is given 
that applies to independent random variables that are not identically distributed. 
We also introduce the delta method, which allows us to compute approximate 
distributions for functions of random variables. 


Statement of the Theorem 


Example 
6.3.1 


A Large Sample. A clinical trial has 100 patients who will receive a treatment. Patients 
who don’t receive the treatment survive for 18 months with probability 0.5 each. We 


assume that all patients are independent. The trial is to see whether the new treatment 
can increase the probability of survival significantly. Let X be the number of patients 
out of the 100 who survive for 18 months. If the probabiity of success were 0.5 for the 
patients on the treatment (the same as without the treatment), then X would have the 
binomial distribution with parameters n = 100 and p = 0.5. The p.f. of X is graphed 
as a bar chart with the solid line in Fig. 6.3. The shape of the bar chart is reminiscent 
of a bell-shaped curve. The normal p.d.f. with the same mean jz = 50 and variance 
o* = 25 as the binomial distribution is also graphed with the dotted line. < 


Figure 6.3 Comparison 
of the binomial p.f. with 
parameters 100 and 0.5 to 
the normal p.d.f. with mean 
50 and variance 25. 
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In Examples 5.4.1 and 5.4.2, we illustrated how the Poisson distribution pro- 
vides a good approximation to a binomial distribution with a large n and small p. 
Example 6.3.1 shows how a normal distribution can be a good approximation to a 
binomial distribution with a large n and not so small p. The central limit theorem 
(Theorem 6.3.1) is a formal statement of how normal distributions can approximate 
distributions of general sums or averages of i.1.d. random variables. 

In Corollary 5.6.2, we saw that if a random sample of size n is taken from the 
normal distribution with mean jp and variance o”, then the sample average X,, has 
the normal distribution with mean yz and variance o”/n. The simple version of the 
central limit theorem that we give in this section says that whenever a random sample 
of size n is taken from any distribution with mean pz and variance o”, the sample 
average X,, will have a distribution that is approximately normal with mean jw and 
variance o7/n. 

This result was established for a random sample from a Bernoulli distribution 
by A. de Moivre in the early part of the eighteenth century. The proof for a random 
sample from an arbitrary distribution was given independently by J. W. Lindeberg 
and P. Lévy in the early 1920s. A precise statement of their theorem will be given 
now, and an outline of the proof of that theorem will be given later in this section. We 
shall also state another central limit theorem pertaining to the sum of independent 
random variables that are not necessarily identically distributed and shall present 
some examples illustrating both theorems. 


Central Limit Theorem (Lindeberg and Lévy). If the random variables X;,..., X,, form 
a random sample of size n from a given distribution with mean j and variance o? 
(0 <0? < ow), then for each fixed number x, 


. x, TH _ 
im P| 5 [nil < ‘ = P(x), (6.3.1) 
where ® denotes the c.d.f. of the standard normal distribution. | 


The interpretation of Eq. (6.3.1) is as follows: If a large random sample is taken 
from any distribution with mean jy and variance o*, regardless of whether this 
distribution is discrete or continuous, then the distribution of the random variable 
n'/?(X,, — )/o will be approximately the standard normal distribution. Therefore, 
the distribution of X,, will be approximately the normal distribution with mean 
and variance o7/n, or, equivalently, the distribution of the sum 1 Xi will be 
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approximately the normal distribution with mean np and variance no?. It is in this 
last form that the central limit theorem was illustrated in Example 6.3.1. 


Tossing a Coin. Suppose that a fair coin is tossed 900 times. We shall approximate the 
probability of obtaining more than 495 heads. 

Fori =1,..., 900, let X; = 1 if a head is obtained on the ith toss and let X; =0 
otherwise. Then £(X;) = 1/2 and Var(X;) = 1/4. Therefore, the values Xj, ..., Xoq9 
form a random sample of size n = 900 from a distribution with mean 1/2 and variance 
1/4. It follows from the central limit theorem that the distribution of the total number 
of heads H = = X; will be approximately the normal distribution for which the 
mean is (900) (1/2) = 450, the variance is (900) (1/4) = 225, and the standard deviation 
is (225)!/? = 15. Therefore, the variable Z = (H — 450)/15 will have approximately 
the standard normal distribution. Thus, 


H —450 _ 495 — ‘50) 
15 15 
=Pr(Z > 3) ¥1— (3) = 0.0013. < 


Pr(H > 495) = Pr ( 


The exact probability 0.0012 to four decimal places. 


Sampling from a Uniform Distribution. Suppose that a random sample of size n = 12 is 
taken from the uniform distribution on the interval [0, 1]. We shall approximate the 
value of Pr (|X, - 3| = 0.1). 

The mean of the uniform distribution on the interval [0, 1]is 1/2, and the variance 
is 1/12 (see Exercise 3 of Sec. 4.3). Since n = 12 in this example, it follows from the 
central limit theorem that the distribution of X,, will be approximately the normal 
distribution with mean 1/2 and variance 1/144. Therefore, the distribution of the 
variable Z = 12(X, — ;) will be approximately the standard normal distribution. 


Hence, 


= Pr(|Z| < 1.2) + 2@(1.2) — 1 = 0.7698. 


For the special case of n = 12, the random variable Z has the form Z = p Dea X; — 6. 
At one time, some computers produced standard normal pseudo-random numbers 
by adding 12 uniform pseudo-random numbers and subtracting 6. <l 


Poisson Random Variables. Suppose that X;,..., X, form arandom sample from the 
Poisson distribution with mean @. Let X,, be the average. Then = 6 and o* =0. 
The central limit theorem says that n!/*(X,, — 0)/0'/? has approximately the standard 
normal distribution. In particular, the central limit theorem says that X,, should be 
close to y with high probability. The probability that |X,, — 6| is less than some small 
number c could be approximated using the standard normal c.d.f.: 


Pr(IX,, Ol < c) a 2 {cn'/9-1/2) =i. (6.3.2) 


< 


The type of convergence that appears in the central limit theorem, specifically, 
Eq. (6.3.1), arises in other contexts and has a special name. 
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Convergence in Distribution/Asymptotic Distribution. Let X;, X, ... be a sequence of 


random variables, and forn = 1, 2,..., let F,, denote the c.d.f. of X,,. Also, let F* be 

ac.d.f. Then it is said that the sequence Xj, X>, .. . converges in distribution to F* if 
lim F,(x) = F*(x), (6.3.3) 
no 


for all x at which F*(x) is continuous. Sometimes, it is simply said that X,, converges 
in distribution to F*, and F* is called the asymptotic distribution of X,,. If F* has a 
name, then we say that X, converges in distribution to that name. 


Thus, according to Theorem 6.3.1, as indicated in Eq. (6.3.1), the random variable 
n/?(X,, — w)/o converges in distribution to the standard normal distribution, or, 
equivalently, the asymptotic distribution of n'/?(X,, — 2)/o is the standard normal 
distribution. 


Effect of the Central Limit Theorem The central limit theorem provides a plausible 
explanation for the fact that the distributions of many random variables studied in 
physical experiments are approximately normal. For example, a person’s height is 
influenced by many random factors. If the height of each person is determined by 
adding the values of these individual factors, then the distribution of the heights of a 
large number of persons will be approximately normal. In general, the central limit 
theorem indicates that the distribution of the sum of many random variables can be 
approximately normal, even though the distribution of each random variable in the 
sum differs from the normal. 


Determining a Simulation Size. In Example 6.2.2 on page 351, an environmental engi- 
neer wanted to determine the size of a simulation to estimate the mean proportion of 
water contaminant that was lead. Use of the Chebyshev inequality in that example 
suggested that a simulation of size 2,000,000 will guarantee that the estimate will be 
less than 0.005 away from the true mean proportion with probability at least 0.98. 
In this example, we shall use the central limit theorem to determine a much smaller 
simulation size that should still provide the same accuracy bound. The estimate of the 
mean proportion will be the average R,, of all of the simulated proportions Rj, ..., R, 
from the n simulations that will be run. As we noted in Example 6.2.2, the variance 
of each R; is o* < 1, and hence the central limit theorem says that R,, has approxi- 
mately the normal distribution with mean equal to the true mean proportion E(R,) 
and variance at most 1/n. Since the probability of being close to the mean decreases 
as the variance increases, we see that 


= 0.005 —0.005 
Pr({R,, — E(R;)| < 0.005) © 0) o( o/ Ji ) 


0.005 —0.005 
>@® ® 
7 Gag, ( 1//n ) 
=26(0.005/n) — 1. 
If we set 20(0.005,/n) — 1= 0.98, we obtain 


(0,99)? = 40,000 x 2.3267 = 216,411. 


A= 
0.0052 
That is, we only need a little more than 10 percent of the simulation size that the 
Chebyshev inequality suggested. (Since o? is actually no more than 1/4, we really only 
need n =54,103. See Exercise 14 in Sec. 6.2 for a proof that a discrete distribution on 
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the interval [0, 1] can have variance at most 1/4. The continuous case is slightly more 
complicated, but also true.) J 


Other Examples of Convergence in Distribution In Chapter 5, we saw three exam- 
ples of limit theorems involving discrete distributions. Theorems 5.3.4, 5.4.5, and 5.4.6 
all showed that a sequence of p.f.’s converged to some other p.f. In Exercise 7 in 
Sec. 6.5, you can prove a general result that implies that the three theorems just 
mentioned are examples of convergence in distribution. 


The Delta Method 


Rate of Service. Customers arrive at a queue for service, and the ith customer is served 
in some time X; after reaching the head of the queue. If we assume that X),..., X, 
form a random sample of service times with mean j and finite variance 07, we might 
be interested in using 1/X,, to estimate the rate of service. The central limit theorem 
tells us something about the approximate distribution of X,, ifn is large, but what can 
we say about the distribution of 1/X,,? 4 


Suppose that X;,..., X,, formarandom sample from a distribution that has finite 
mean jz and finite variance o”. The central limit theorem says that n!/*(X,, — )/o has 
approximately the standard normal distribution. Now suppose that we are interested 
in the distribution of some function a of X,,. We shall assume that a is a differentiable 
function whose derivative is nonzero at jw. We shall approximate the distribution of 
a(X,,) by a method known in statistics as the delta method. 


Delta Method. Let Y;, Y5,... be a sequence of random variables, and let F* be a 
continuous c.d.f. Let 6 be a real number, and let ay, ay, . . . be a sequence of positive 
numbers that increase to oo. Suppose that a, (Y,, — 6) converges in distribution to F*. 
Let a be a function with continuous derivative such that a'(@) 4 0. Then a,[a(Y,,) — 
a(6)|/a’(@) converges in distribution to F*. 


Proof We shall give only an outline of the proof. Because a, — 00, Y,, must get close 
to @ with high probability as n — oo. If not, |a,(Y,, — @)| would go to co with nonzero 
probability and then the c.d.f. of a,(Y,, — 9) would not converge to a c.d.f. Because a 
is continuous, a(Y,,) must also be close to a(@) with high probability. Therefore, we 
shall use a Taylor series expansion of a(Y,,) around 6, 


a(¥,) ¥ a8) + a'()(Y, — 9), (6.3.4) 


where we have ignored all terms involving (Y,, — @)* and higher powers. Subtract a (0) 

from both sides of Eq. (6.3.4), and then multiply both sides by a,,/a'(@) to get 
Phe — 6) © ay(Yn — 8). (6.3.5) 

We then conclude that the distribution of the left side of Eq. (6.3.5) will be ap- 


proximately the same as the distribution of the right side of the equation, which 
is approximately F*. a 


The most common application of Theorem 6.3.2 occurs when Y, is the average 
of arandom sample from a distribution with finite variance. We state that case in the 
following corollary. 


Delta Method for Average of a Random Sample. Let Xj, X>, ... be a sequence of Lid. 
random variables from a distribution with mean pw and finite variance o”. Let a 
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be a function with continuous derivative such that a’() 40. Then the asymptotic 
distribution of 
1/2 


—[a(X,) — a(u)] 
oa! (1) 


is the standard normal distribution. 


Proof Apply Theorem 6.3.2 with Y, = X,, a, =n'/?/o, 0 =, and F* being the 
standard normal c.d.f. a 


A common way to report the result in Corollary 6.3.1 is to say that the distribution 
of a(X,,) is approximately the normal distribution with mean a(j) and variance 


o7[a'(u)P/n. 


Rate of Service. In Example 6.3.6, we are interested in the distribution of w(X,,) where 
a(x) =1/x for x > 0. We can apply the delta method by finding a(x) = —1/x?. It 
follows that the asymptotic distribution of 


nil? 42 1 1 
oO x. 2 


is the standard normal distribution. Alternatively, we might say that 1/X,, has ap- 
proximately the normal distribution with mean 1/y and variance o7/[nj‘]. 4 


Variance Stabilizing Transformations If we were to observe a random sample of 
Poisson random variables as in Example 6.3.4, we would assume that @ is unknown. 
In such a case we cannot compute the probability in Eq. (6.3.2), because the ap- 
proximate variance of X,, depends on 6. For this reason, it is sometimes desirable 
to transform X,, by a function @ so that the approximate distribution of a(X,,) has a 
variance that is a known value. Such a function is called a variance stabilizing transfor- 
mation. We can often find a variance stabilizing transformation by running the delta 
method in reverse. In general, we note that the approximate distribution of a(X,,) 
has variance w’(j1)?07/n. In order to make this variance constant, we need a(j) to 
be a constant times 1/c. If 0” is a function g(j), then we achieve this goal by letting 


M dx 
a(n) =f Pee (6.3.6) 


where a is an arbitrary constant that makes the integral finite. 


Poisson Random Variables. In Example 6.3.4, we have o? = 6 = , so that g() = p. 
According to Eq. (6.3.6), we should let 


_ M dx _ 1/2 
a(n) =f xia = 2H . 


It follows that ax, * has approximately the normal distribution with mean 2601/7 and 
variance 1/n. For each number c > 0, we have 


Pr(I2X,, ~291/2| < c) ~ 2 (cn?) =i} (6.3.7) 


In Chapter 8, we shall see how to use Eq (6.3.7) to estimate 6 when we assume 
that 6 is unknown. <4 
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The Central Limit Theorem (Liapounov) for the Sum of Independent Random Variables 


Theorem 
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We shall now state a central limit theorem that applies to a sequence of random 
variables X1, X>, .. . that are independent but not necessarily identically distributed. 
This theorem was first proved by A. Liapounov in 1901. We shall assume that E(X;) = 
fu; and Var(X;) = o? fori =1,...,n. Also, we shall let 


ee ae 
= 2°" 
2 
(See?) 


Then £(Y,) =0 and Var(Y,,) = 1. The theorem that is stated next gives a sufficient 
condition for the distribution of this random variable Y, to be approximately the 
standard normal distribution. 


Y, 


(6.3.8) 


Suppose that the random variables X,, X>,... are independent and that E(|X; — 
u;|°) < oo fori =1, 2,... Also, suppose that 


lim Dina F (i ~ ml) =, (6.3.9) 


n—->oo n 2 3/2 
ae 9; ) 


Finally, let the random variable Y,, be as defined in Eq. (6.3.8). Then, for each fixed 
number x, 


lim Pr(Y, <x) = ®(x). (6.3.10) 
noo = 

The interpretation of this theorem is as follows: If Eq. (6.3.9) is satisfied, then for 
every large value of n, the distribution of }°"_, X; will be approximately the normal 
distribution with mean )7”_, 4; and variance )~"_, «7. It should be noted that when 
the random variables X,, X5, ... are identically distributed and the third moments 
of the variables exist, Eq. (6.3.9) will automatically be satisfied and Eq. (6.3.10) then 
reduces to Eq. (6.3.1). 

The distinction between the theorem of Lindeberg and Lévy and the theorem 
of Liapounov should be emphasized. The theorem of Lindeberg and Lévy applies to 
a sequence of i.i.d. random variables. In order for this theorem to be applicable, it 
is sufficient to assume only that the variance of each random variable is finite. The 
theorem of Liapounov applies to a sequence of independent random variables that 
are not necessarily identically distributed. In order for this theorem to be applicable, 
it must be assumed that the third moment of each random variable is finite and 
satisfies Eq. (6.3.9). 


The Central Limit Theorem for Bernoulli Random Variables By applying the 
theorem of Liapounov, we can establish the following result. 


Suppose that the random variables X,,..., X, are independent and X; has the 
Bernoulli distribution with parameter p; (i = 1, 2, . . .). Suppose also that the infinite 
series )\**, p;(1 — p;) is divergent, and let 


_ Pa X;- pee Pi 


Yn 
(P= p;)) 


(6.3.11) 
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Then for every fixed number x, 


lim Pr(Y,, <x) = ®(x). (6.3.12) 


Proof Here Pr(X; = 1) = p; and Pr(X; = 0) = 1 — p;. Therefore, 
E(X;) = p;, Var(X;) = pi — pi), 
E(IX; — pil®) = pid — pp + = pp? = pil — pd (v7 + = BP) 
a pl =p), (6.3.13) 
It follows that 
ames (Ix; = pil?) . 1 
(ey p= es 7 ye 4 pid — oy 


Since the infinite series )°>°, p;(1 — p;) is divergent, then ~"_, p;(1— p;) > 00 
as n — oo, and it can be seen from the relation (6.3.14) that Eq. (6.3.9) will be 
satisfied. In turn, it follows from Theorem 6.3.3 that Eq. (6.3.10) will be satisfied. 
Since Eq. (6.3.12) is simply a restatement of Eq. (6.3.10) for the particular random 
variables being considered here, the proof of the theorem is complete. a 


(6.3.14) 


Theorem 6.3.4 implies that if the infinite series )°°° , p;(1 — p;) is divergent, then 
the distribution of the sum }°"_, X; of a large number of independent Bernoulli 
random variables will be approximately the normal distribution with mean )~"_, p; 
and variance }~"_, p;(1— p;). It should be kept in mind, however, that a typical 
practical problem will involve only a finite number of random variables X1,..., X;, 
rather than an infinite sequence of random variables. In such a problem, it is not 
meaningful to consider whether or not the infinite series )°*° , p;(1 — p;) is divergent, 
because only a finite number of values p),..., p, will be specified in the problem. 
In a certain sense, therefore, the distribution of the sum }°"_, X; can always be 
approximated by a normal distribution. The critical question is whether or not this 
normal distribution provides a good approximation to the actual distribution of 
>o;_, X;- The answer depends, of course, on the values of pj, ..., Dp- 

Since the normal distribution will be attained more and more closely as 
Ye, Pi — pi) > 08, the normal distribution provides a good approximation when 
the value of >>", p;(1— p;) is large. Furthermore, since the value of each term 
p,;(1 — p;) is a maximum when p, = 1/2, the approximation will be best when n is 
large and the values of p;,..., p, are close to 1/2. 


Examination Questions. Suppose that an examination contains 99 questions arranged 
in a sequence from the easiest to the most difficult. Suppose that the probability that 
a particular student will answer the first question correctly is 0.99, the probability that 
he will answer the second question correctly is 0.98, and, in general, the probability 
that he will answer the ith question correctly is 1—1i/100 fori =1,..., 99. It is 
assumed that all questions will be answered independently and that the student must 
answer at least 60 questions correctly to pass the examination. We shall determine 
the probability that the student will pass. 

Let X; = 1if the ith question is answered correctly and X; = 0 otherwise. Then 
E(X;) = p; = 1 — (i/100) and Var(X;) = p;(1 — p;) = ((/100)[1 — (//100)]. Also, 


99 99 
Swe) a 
100 & 1002 
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It follows from the central limit theorem that the distribution of the total number 
of questions that are answered correctly, which is yar X;, will be approximately 
the normal distribution with mean 49.5 and standard deviation (16.665)!/2 = 4.08. 
Therefore, the distribution of the variable 
a1 Xi — 49.5 
7 4.08 
will be approximately the standard normal distribution. It follows that 
n 
(32 X;= ) = Pr(Z > 2.5735) ~ 1 — ®(2.5735) = 0.0050. < 
i=1 
o, 
“~ 
¢ Outline of Proof of Central Limit Theorem 


Convergence of the Moment Generating Functions Moment generating functions 
are important in the study of convergence in distribution because of the following 
theorem, the proof of which is too advanced to be presented here. 


Theorem Let Xj, X>,... be a sequence of random variables. For n = 1, 2,..., let F,, denote 
6.3.5 the c.d.f. of X,,, and let y,, denote the m.g.f. of X,,. 

Also, let X* denote another random variable with c.d.f. F* and m.g.f. y*. Suppose 

that the m.g.f’s w,, and y* exist (n = 1, 2,...). If lim, _... W,() = w*(#) for all values 


of tf in some interval around the point r = 0, then the sequence X,, X>,... converges 
in distribution to X*. | 

In other words, the sequence of c.d.f’s F,, F>, ... must converge to the c.d.f. F* 
if the corresponding sequence of m.g.f’s W, YW, ... converges to the m.g.f. wy. 


Outline of the Proof of Theorem 5.7.1 We are now ready to outline a proof of Theo- 
rem 6.3.1, which is the central limit theorem of Lindeberg and Lévy. We shall assume 
that the variables X,,..., X,, form a random sample of size n from a distribution 
with mean jz and variance o”. We shall also assume, for convenience, that the m.g.f. 
of this distribution exists, although the central limit theorem is true even without this 
assumption. 

Fori=1,...,n, let Y; = (X; — w)/o. Then the random variables Y;,..., Y,, are 
i.i.d., and each has mean 0 and variance 1. Furthermore, let 


1/2 a 3 n 
n o ni/2 rm L 
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We shall show that Z, converges in distribution to a random variable having the 
standard normal distribution, as indicated in Eq. (6.3.1), by showing that the m.g.f. 
of Z,, converges to the m.g.f. of the standard normal distribution. 

If y(t) denotes the m.g.f. of each random variable Y; (( = 1, ..., 7), thenit follows 
from Theorem 4.4.4 that the m.g.f. of the sum )~"_, Y; will be [y-()]’. Also, it follows 
from Theorem 4.4.3 that the m.g.f. ¢,(t) of Z, will be 


t n 
fn (t) = Heal : 


In this problem, w’/(0) = E(Y;) =0 and w"(0) = E Sea) = 1. Therefore, the Taylor 
series expansion of y(t) about the point t = 0 has the following form: 


2 3 
VO) = WO) +100) + Sw"O + FW" + 


rer Pp 
—{4e4 yo poh 
tg tay Ot 


Also, 


2 3, ee 
oa |i+ £4 SO ae 


Apply Theorem 5.3.3 with 1 + a,,/n equal to the expression inside brackets in (6.3.15) 


and c, =n. Since 
vs 34 2 
kim f+ G4. [aF. 


noo} 2 | 3inl/2 


it follows that 
li t= 1a 6.3.16) 
im, Salt) = exp| SF (6.3. 


Since the right side of Eq. (6.3.16) is the m.g.f. of the standard normal distribution, 
it follows from Theorem 6.3.5 that the asymptotic distribution of Z,, must be the 
standard normal distribution. 

An outline of the proof of the central limit theorem of Liapounov can also be 
given by proceeding along similar lines, but we shall not consider this problem further 
here. 


>, 
“9 


Summary 


Two versions of the central limit theorem were given. They conclude that the distri- 
bution of the average of a large number of independent random variables is close 
to a normal distribution. One theorem requires that the random variables all have 
the same distribution with finite variance. The other theorem does not require that 
the random variables be identically distributed, but instead requires that their third 
moments exist and satisfy condition (6.3.9). The delta method lets us find the approx- 
imate distribution of a smooth function of a sample average. 
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Exercises 


1. Each minute a machine produces a length of rope with 
mean of 4 feet and standard deviation of 5 inches. Assum- 
ing that the amounts produced in different minutes are 
independent and identically distributed, approximate the 
probability that the machine will produce at least 250 feet 
in one hour. 


2. Suppose that 75 percent of the people in a certain me- 
tropolitan area live in the city and 25 percent of the people 
live in the suburbs. If 1200 people attending a certain con- 
cert represent a random sample from the metropolitan 
area, what is the probability that the number of people 
from the suburbs attending the concert will be fewer than 
270? 


3. Suppose that the distribution of the number of defects 
on any given bolt of cloth is the Poisson distribution with 
mean 5, and the number of defects on each bolt is counted 
for a random sample of 125 bolts. Determine the proba- 
bility that the average number of defects per bolt in the 
sample will be less than 5.5. 


4. Suppose that a random sample of size n is to be taken 
from a distribution for which the mean is yz and the stan- 
dard deviation is 3. Use the central limit theorem to de- 
termine approximately the smallest value of n for which 
the following relation will be satisfied: 


Pr(|X,, — | < 0.3) > 0.95. 


5. Suppose that the proportion of defective items in a 
large manufactured lot is 0.1. What is the smallest random 
sample of items that must be taken from the lot in order 
for the probability to be at least 0.99 that the proportion 
of defective items in the sample will be less than 0.13? 


6. Suppose that three girls A, B, and C throw snowballs at 
a target. Suppose also that girl A throws 10 times, and the 
probability that she will hit the target on any given throw is 
0.3; girl B throws 15 times, and the probability that she will 
hit the target on any given throw is 0.2; and girl C throws 
20 times, and the probability that she will hit the target on 
any given throw is 0.1. Determine the probability that the 
target will be hit at least 12 times. 


7. Suppose that 16 digits are chosen at random with re- 
placement from the set {0, .. . , 9}. What is the probability 
that their average will lie between 4 and 6? 


8. Suppose that people attending a party pour drinks from 
a bottle containing 63 ounces of a certain liquid. Suppose 
also that the expected size of each drink is 2 ounces, that 
the standard deviation of each drink is 1/2 ounce, and 
that all drinks are poured independently. Determine the 
probability that the bottle will not be empty after 36 drinks 
have been poured. 


9. A physicist makes 25 independent measurements of 
the specific gravity of a certain body. He knows that the 
limitations of his equipment are such that the standard 
deviation of each measurement is o units. 


a. By using the Chebyshev inequality, find a lower 
bound for the probability that the average of his mea- 
surements will differ from the actual specific gravity 
of the body by less than o /4 units. 


b. By using the central limit theorem, find an approxi- 
mate value for the probability in part (a). 


10. A random sample of n items is to be taken from a 
distribution with mean yp and standard deviation o. 


a. Use the Chebyshev inequality to determine the 
smallest number of items n that must be taken in 
order to satisfy the following relation: 


mI, =p =) > 0.99, 


b. Use the central limit theorem to determine the small- 
est number of items n that must be taken in order to 
satisfy the relation in part (a) approximately. 


11. Suppose that, on the average, 1/3 of the graduating 
seniors at a certain college have two parents attend the 
graduation ceremony, another third of these seniors have 
one parent attend the ceremony, and the remaining third 
of these seniors have no parents attend. If there are 600 
graduating seniors in a particular class, what is the proba- 
bility that not more than 650 parents will attend the grad- 
uation ceremony? 


12. Let X,, be a random variable having the binomial dis- 
tribution with parameters n and p,. Assume that 
lim). 60 "Pn =. Prove that the m.g.f. of X, converges 
to the m.g.f. of the Poisson distribution with mean A. 


13. Suppose that X;,..., X,, form a random sample from 
a normal distribution with unknown mean @ and variance 
o”. Assuming that 6 4 0, determine the asymptotic distri- 


. 3 
bution of X oie 


14. Suppose that Xj, ..., X,, form a random sample from 


a normal distribution with mean 0 and unknown variance 


o. 


a. Determine the asymptotic distribution of the statistic 
-1 
1 n 2 
G Yai ’) 
b. Find a variance stabilizing transformation for the 
statistic  0"_, X?. 


15. Let X1, X2,... be a sequence of i.i.d. random vari- 
ables each having the uniform distribution on the interval 
[0, 6] for some real number @ > 0. For each n, define Y,, to 
be the maximum of Xj, ..., X,. 


a. Show that the c.d-f. of Y,, is 


0 
(y/o)"” if0<y <8, 


F(y) = 


Hint: Read Example 3.9.6. 
b. Show that Z, =n(Y,, — 9) converges in distribution 
to the distribution with c.d.f. 
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Figure 6.4 Comparison of 
binomial and normal c.d.f.’s. 
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exp(z/0) ifz <0, 


F*(z) = 
) ei fies 0: 


if x <0, 


Hint: Apply Theorem 5.3.3 after finding the c.d.f. of 


n° 


ify>0. 


c. Use Theorem 6.3.2 to find the approximate distribu- 
tion of y when n is large. 


6.4 The Correction for Continuity 


Some applications of the central limit theorem allow us to approximate the proba- 
bility that a discrete random variable X lies in an interval [a, b] by the probability 
that a normal random variable lies in that interval. The approximation can be 
improved slightly by being careful about how we approximate Pr(X =a) and 
Pr(X =D). 


Approximating a Discrete Distribution by a Continuous Distribution 


A Large Sample. In Example 6.3.1, we illustrated how the normal distribution with 
mean 50 and variance 25 could approximate the distribution of a random variable X 
that has the binomial distribution with parameters 100 and 0.5. In particular, if Y has 
the normal distribution with mean 50 and variance 25, we know that Pr(Y < x) isclose 
to Pr(X <x) for all x. But the approximation has some systematic errors. Figure 6.4 
shows the two c.d.f’s over the range 30 < x < 70. The two c.d.f’s are very close at 
x =n +0.5 for each integer n. But for each integer n, Pr(Y < x) < Pr(X <x) forxa 
little above n and Pr(Y < x) > Pr(X <x) for x alittle below. We ought to be able to 
make use of these systematic discrepancies in order to improve the approximation. 

< 


Suppose that X has a discrete distribution that can be approximated by a normal 
distribution, such as in Example 6.4.1. In this section, we shall describe a standard 
method for improving the quality of such an approximation based on the systematic 
discrepancies that were noted at the end of Example 6.4.1. 

Let f(x) be the p.f. of the discrete random variable X, and suppose that we wish 
to approximate the distribution of X by a continuous distribution with p.d.f. g(x). To 
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Figure 6.5 Approximating 
a bar chart by using a p.d.f. 


aid the discussion, let Y be a random variable with p.d.f. g. Also, for simplicity, we 
shall assume that all of the possible values of X are integers. This condition is satis- 
fied for the binomial, hypergeometric, Poisson, and negative binomial distributions 
described in this text. 

If the distribution of Y provides a good approximation to the distribution of X, 
then for all integers a and b, we can approximate the discrete probability 


b 


Pra<X<b)=) > f(x) (6.4.1) 


x=a 


by the continuous probability 


b 

Pr(a < Y <b) =) g(x) dx. (6.4.2) 
a 

Indeed, this approximation was used in Examples 6.3.2 and 6.3.9, where g(x) was the 

appropriate normal p.d.f. derived from the central limit theorem. 

This simple approximation has the following shortcoming: Although Pr(X > a) 
and Pr(X >a) will typically have different values for the discrete distribution of 
X, Pr(¥Y > a) = Pr(Y > a) because Y has a continuous distribution. Another way of 
expressing this shortcoming is as follows: Although Pr(X = x) > 0 for each integer x 
that is a possible value of X, Pr(Y = x) = 0 for all x. 


Approximating a Bar Chart 


The p.f. f(x) of a discrete random variable X can be represented by a bar chart, as 
sketched in Fig. 6.5. For each integer x, the probability of {X = x} is represented 


by the area of a rectangle with a base that extends from x — tox + and with a 


height f(x). Thus, the area of the rectangle for which the center of the base is at the 
integer x is simply f(x). An approximating p.d.f. g(x) is also sketched in Fig. 6.5. A 
bar chart with areas of bars proportional to probabilities is analogous to a histogram 
(see page 165) with areas of bars proportional to proportions of a sample. 

From this point of view, it can be seen that Pr(a < X <b), as specified in 
Eq. (6.4.1), is the sum of the areas of the rectangles in Fig. 6.5 that are centered 


ata,a+1,...,b. It can also be seen from Fig. 6.5 that the sum of these areas is 
gx) 
Fa) 
f(a) 
f(b) 
@ oe o @ > 
a ¥ b x 
a5 ats x- 5 xt+i b- 5 b+5 


Figure 6.6 Comparison of 
binomial c.d.f. with normal 
c.d.f. shifted to the right and 
to the left by 0.5. 
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approximated by the integral 


b+(1/2) 


g(x) dx. (6.4.3) 


Pr(a-1/2.<¥ <b+1/2)= f 
a—(1/2) 

The adjustment from the integral in (6.4.2) to the integral in (6.4.3) is called the 
correction for continuity. 


A Large Sample. At the end of Example 6.4.1, we found that when x was a little above 
an integer, the approximating probability Pr(Y <x) is a bit smaller than the actual 
probability Pr(X <x). The correction for continuity shifts the c.d-f. of Y to the left 
by 0.5 when we want to compute Pr(Y < x) for x a little above an integer. This shift 
replaces Pr(Y < x) by Pr(Y < x + 0.5), which is larger and usually closer to Pr(X < x). 
Similarly, when we want to compute Pr(Y <x) when x is a little below an integer, 
the correction for continuity shifts the c.d.f. of Y to the right by 0.5 which replaces 
Pr(Y <x) by Pr(Y <x — 0.5). Figure 6.6 illustrates both of these shifts and shows how 
they each approximate the actual binomial c.d.f. better than the unshifted normal 
c.d.f. in Fig. 6.4. <l 


If we use the correction for continuity, we find that the probability f(a) of the 
single integer a can be approximated as follows: 


1 1 
Pr(X =a) =Pr(a—-—=<X<a+— 
( a) (« 55 <a ;) 


a+(1/2) 
- / nC we (6.4.4) 
a—(1/2) 
Similarly, 
PrX >a) =PrX > a4 D=Pr( XE a+ >) 
lee) 
~ | g(x) dx. (6.4.5) 
a+(1/2) 


Examination Questions. To illustrate the use of the correction for continuity, we shall 
again consider Example 6.3.9. In that example, an examination contains 99 questions 
of varying difficulty and it is desired to determine Pr(X > 60), where X denotes the 
total number of questions that a particular student answers correctly. Then, under the 
conditions of the example, it is found from the central limit theorem that the discrete 
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Example 
6.4.4 


Exercises 


1. Let X1, : 


.., X39 be independent random variables 
each having a discrete distribution with p.f. 


distribution of X could be approximated by the normal distribution with mean 49.5 
and standard deviation 4.08. Let Z = (X — 49.5) /4.08. 
If we use the correction for continuity, we obtain 


Pr(X > 60) = Pr(X > 59.5) = Pr(z > | 


4.08 
~ 1 — ©(2.4510) = 0.007. 


This value is somewhat larger than the value 0.005, which was obtained in Sec. 6.3, 
without the correction. < 


Coin Tossing. Suppose that a fair coin is tossed 20 times and that all tosses are 
independent. What is the probability of obtaining exactly 10 heads? 

Let X denote the total number of heads obtained in the 20 tosses. According 
to the central limit theorem, the distribution of X will be approximately the normal 
distribution with mean 10 and standard deviation [(20)(1/2)(1/2)]!/* = 2.236. If we 
use the correction for continuity, 


Pr(X = 10) = Pr(9.5 < X < 10.5) 


-P-(- 05 7-05 ) 
2.236 2.236 
= ©(0.2236) — &(—0.2236) = 0.177. 


The exact value of Pr(X = 10) found from the table of binomial probabilities 
given at the back of this book is 0.1762. Thus, the normal approximation with the 
correction for continuity is quite good. 4 


Summary 


Let X be a random variable that takes only integer values. Suppose that X has 
approximately the normal distribution with mean yw and variance o”. Let a and b be 
integers, and suppose that we wish to approximate Pr(a < X <b). The correction to 
the normal distribution approximation for continuity is to use ®([b + 1/2 — p]/o) — 
®([a — 1/2 — w/o) rather than ®([b — ]/o0) — ®([a — p]/o) as the approximation. 


a. Determine approximately the value of Pr(X = 4) by 
using the central limit theorem with the correction 
for continuity. 


1/4 ifx =0or2, b. Compare the answer obtained in part (a) with the 
f@)=41/2 ifx=1, exact value of this probability. 
0 otherwise. 


3. Using the correction for continuity, determine the 


Use the central limit theorem and the correction for con- 
tinuity to approximate the probability that X; +---+ X39 
is at most 33. 


2. Let X denote the total number of successes in 15 
Bernoulli trials, with probability of success p = 0.3 on each 
trial. 


probability required in Example 6.3.2. 


4. Using the correction for continuity, determine the 
probability required in Exercise 2 of Sec. 6.3. 


5. Using the correction for continuity, determine the 
probability required in Exercise 3 of Sec. 6.3. 


6. Using the correction for continuity, determine the 
probability required in Exercise 6 of Sec. 6.3. 
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7. Using the correction for continuity, determine the 
probability required in Exercise 7 of Sec. 6.3. 


6.5 Supplementary Exercises 


1. Suppose that a pair of balanced dice are rolled 120 
times, and let X denote the number of rolls on which the 
sum of the two numbers is 7. Use the central limit theorem 
to determine a value of k such that Pr(|X — 20| < k) is 
approximately 0.95. 


2. Suppose that X has a Poisson distribution with a very 
large mean 4. Explain why the distribution of X can be 
approximated by the normal distribution with mean A 
and variance A. In other words, explain why (X — A)/a1/2 
converges in distribution, as 4 — oo, to arandom variable 
having the standard normal distribution. 


3. Suppose that X has the Poisson distribution with mean 
10. Use the central limit theorem, both without and with 
the correction for continuity, to determine an approximate 
value for Pr(8 < X < 12). Use the table of Poisson proba- 
bilities given in the back of this book to assess the quality 
of these approximations. 


4. Suppose that X is a random variable such that E(X*) 
exists and Pr(X > 0) = 1. Prove that for k > 0 and t > 0, 
E(X*) 


t 


Prix >t) < 


5. Suppose that X,,..., X, form a random sample from 
the Bernoulli distribution with parameter p. Let X,, be 
the sample average. Find a variance stabilizing transfor- 
mation for X,,. Hint: When trying to find the integral of 
(p[1 — p])~!/7, make the substitution z = /p and then 
think about arcsin, the inverse of the sin function. 


6. Suppose that X),..., X, form a random sample from 
the exponential distribution with mean 6. Let X,, be the 
sample average. Find a variance stabilizing transformation 
for X,,. 


7. Suppose that X1, X>, ...is asequence of positive inte- 
ger-valued random variables. Suppose that there is a func- 
tion f such that for every m= 1, 2,..., lim,_,.. Pr(X, = 
m) = f(m), 4 f(m) =1, and f(x) =0 for every x that 
is not a positive integer. Let F be the discrete c.d.f. whose 
p.f. is f. Prove that X, converges in distribution to F. 


8. Let {p,}°°, be a sequence of numbers such that 0 < 
P, < (for alln. Assume that lim, _,., Pp, = p withO < p< 
1. Let X,, have the binomial distribution with parameters 
k and p, for some positive integer k. Prove that X,, con- 
verges in distribution to the binomial distribution with 
parameters k and p. 


9. Suppose that the number of minutes required to serve a 
customer at the checkout counter of a supermarket has an 
exponential distribution for which the mean is 3. Using the 
central limit theorem, approximate the probability that 
the total time required to serve a random sample of 16 
customers will exceed one hour. 


10. Suppose that we model the ocurrence of defects on a 
fabric manufacturing line as a Poisson process with rate 
0.01 per square foot. Use the central limit theorem (both 
with and without the correction for continuity) to approxi- 
mate the probability that one would find at least 15 defects 
in 2000 square feet of fabric. 


11. Let X have the gamma distribution with parameters 
n and 3, where n is a large integer. 


a. Explain why one can use the central limit theorem 
to approximate the distribution of X by a normal 
distribution. 


b. Which normal distribution approximates the distri- 
bution of X? 


12. Let X have the negative binomial distribution with 
parameters n and 0.2, where n is a large integer. 
a. Explain why one can use the central limit theorem 
to approximate the distribution of X by a normal 
distribution. 


b. Which normal distribution approximates the distri- 
bution of X? 
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7.1 Statistical Inference 


Recall our various clinical trial examples. What would we say is the probability that 
a future patient will respond successfully to treatment after we observe the results 
from a collection of other patients? This is the kind of question that statistical 
inference is designed to address. In general, statistical inference consists of making 
probabilistic statements about unknown quantities. For example, we can compute 
means, variances, quantiles, probabilities, and some other quantities yet to be 
introduced concerning unobserved random variables and unknown parameters 
of distributions. Our goal will be to say what we have learned about the unknown 
quantities after observing some data that we believe contain relevant information. 
Here are some other examples of questions that statistical inference can try to 
answer. What can we say about whether a machine is functioning properly after we 
observe some of its output? In a civil lawsuit, what can we say about whether there 
was discrimination after observing how different ethnic groups were treated? The 
methods of statistical inference, which we shall develop to address these questions, 
are built upon the theory of probability covered in the earlier chapters of this text. 


Probability and Statistical Models 


In the earlier chapters of this book, we discussed the theory and methods of probabil- 
ity. As new concepts in probability were introduced, we also introduced examples of 
the use of these concepts in problems that we shall now recognize as statistical infer- 
ence. Before discussing statistical inference formally, it is useful to remind ourselves 
of those probability concepts that will underlie inference. 


Lifetimes of Electronic Components. A company sells electronic components and they 
are interested in knowing as much as they can about how long each component is 
likely to last. They can collect data on components that have been used under typical 
conditions. They choose to use the family of exponential distributions to model the 
length of time (in years) from when a component is put into service until it fails. 
They would like to model the components as all having the same failure rate 6, 
but there is uncertainty about the specific numerical value of 0. To be more precise, 
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let X,, X2,... stand for a sequence of component lifetimes in years. The company 
believes that if they knew the failure rate 6, then X1, X>,... would be i.i.d. random 
variables having the exponential distribution with parameter 0. (See Sec. 5.7 for the 
definition of exponential distributions. We are using the symbol 6 for the parameter 
of our exponential distributions rather than £ to match the rest of the notation in this 
chapter.) Suppose that the data that the company will observe consist of the values 
of X;,..., X,, but that they are still interested in X41, Xn42,.... They are also 
interested in 6 because it is related to the average lifetime. As we saw in Eq. (5.7.17), 
the mean of an exponential random variable with parameter @ is 1/0, which is why 
the company thinks of 6 as the failure rate. 

We imagine an experiment whose outcomes are sequences of lifetimes as de- 
scribed above. As mentioned already, if we knew the value 6, then X;, X,... would 
be i.i.d. random variables. In this case, the law of large numbers (Theorem 6.2.4) says 
that the average i 7, X; converges in probability to the mean 1/0. And Theo- 
rem 6.2.5 says that n/ }~"_, X; converges in probability to 9. Because 6 is a function 
of the sequence of lifetimes that constitute each experimental outcome, it can be 
treated as a random variable. Suppose that, before observing the data, the com- 
pany believes that the failure rate is probably around 0.5/year but there is quite a 
bit of uncertainty about it. They model 6 as a random variable having the gamma 
distribution with parameters 1 and 2. To rephrase what was stated earlier, they also 
model Xj, X>,... as conditionally i.i.d. exponential random variables with param- 
eter @ given 6. They hope to learn more about 6 from examining the sample data 
X1,..., X. They can never learn 6 precisely, because that would require observ- 
ing the entire infinite sequence X1, X,.... For this reason, 6 is only hypothetically 
observable. < 


Example 7.1.1 illustrates several features that will be common to most statistical 
inference problems and which constitute what we call a statistical model. 


Statistical Model. A statistical model consists of an identification of random variables 
of interest (both observable and only hypothetically observable), a specification of a 
joint distribution or a family of possible joint distributions for the observable random 
variables, the identification of any parameters of those distributions that are assumed 
unknown and possibly hypothetically observable, and (if desired) a specification for 
a (joint) distribution for the unknown parameter(s). When we treat the unknown 
parameter(s) 6 as random, then the joint distribution of the observable random 
variables indexed by @ is understood as the conditional distribution of the observable 
random variables given 6. 


In Example 7.1.1, the observable random variables of interest form the sequence 
X1, Xo,..., While the failure rate 6 is hypothetically observable. The family of 
possible joint distributions of X,, X5, ... is indexed by the parameter @. The joint 
distribution of the observables corresponding to the value @ is that X;, X,... are 
1.i.d. random variables each having the exponential distribution with parameter 6. 
This is also the conditional distribution of X1, Xo, ... given 6 because we are treating 
6 asaradom variable. The distribution of 0 is the gamma distribution with parameters 
1 and 2. 


Note: Redefining Old Ideas. The reader will notice that a statistical model is nothing 
more than a formal identification of many features that we have been using in various 
examples throughout the earlier chapters of this book. Some examples need only 
a few of the features that make up a complete specification of a statistical model, 
while other examples use the complete specification. In Sections 7.1-7.4, we shall 
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introduce a considerable amount of terminology, most of which is mere formalization 
of concepts that have been introduced and used in several places earlier in the book. 
The purpose of all of this formalism is to help us to keep the concepts organized so 
that we can tell when we are applying the same ideas in new ways and when we are 
introducing new ideas. 

We are now ready formally to introduce statistical inference. 


Statistical Inference. A statistical inference is a procedure that produces a probabilistic 
statement about some or all parts of a statistical model. 


By a “probabilistic statement” we mean a statement that makes use of any of the 
concepts of probability theory that were discussed earlier in the text or are yet to 
be discussed later in the text. Some examples include a mean, a conditional mean, a 
quantile, a variance, a conditional distribution for a random variable given another, 
the probability of an event, a conditional probability of an event given something, 
and so on. In Example 7.1.1, here are some examples of statistical inferences that 
one might wish to make: 


e Produce a random variable Y (a function of X;,..., X,,) such that Pr(y > 
6|0) =0.9. 
e Produce a random variable Y that we expect to be close to 6. 


m+10 xX 


¢ Compute how likely it is that the average of the next 10 lifetimes, a pene 


is at least 2. 


¢ Say something about how confident we are that 6 < 0.4 after observing Xj,..., 
Xx 


m°* 


All of these types of inference and others will be discussed in more detail later in this 
book. 

In Definition 7.1.1, we distinguished between observable and hypothetically ob- 
servable random variables. We reserved the name observable for a random variable 
that we are essentially certain that we could observe if we devoted the necessary ef- 
fort to observe it. The name hypothetically observable was used for a random variable 
that would require infinite resources to observe, such as the limit (as n — oo) of the 
sample averages of the first n observables. In this text, such hypothetically observ- 
able random variables will correspond to the parameters of the joint distribution of 
the observables as in Example 7.1.1. Because these parameters figure so prominently 
in many of the types of inference problems that we will see, it pays to formalize the 
concept of parameter. 


Parameter/Parameter space. In a problem of statistical inference, a characteristic or 
combination of characteristics that determine the joint distribution for the random 
variables of interest is called a parameter of the distribution. The set Q of all pos- 
sible values of a parameter @ or of a vector of parameters (6), ..., 0.) is called the 
parameter space. 


All of the families of distributions introduced earlier (and to be introduced later) 
in this book have parameters that are included in the names of the individual mem- 
bers of the family. For example, the family of binomial distributions has parameters 
that we called n and p, the family of normal distributions is parameterized by the 
mean uw and variance o” of each distribution, the family of uniform distributions on 
intervals is parameterized by the endpoints of the intervals, the family of exponential 
distributions is parameterized by the rate parameter 6, and so on. 
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In Example 7.1.1, the parameter 0 (the failure rate) must be positive. Therefore, 
unless certain positive values of 6 can be explicitly ruled out as possible values of 6, 
the parameter space Q will be the set of all positive numbers. As another example, 
suppose that the distribution of the heights of the individuals in a certain population 
is assumed to be the normal distribution with mean jz and variance o, but that the 
exact values of yz and o” are unknown. The mean ju and the variance o” determine 
the particular normal distribution for the heights of individuals. So (u, 07) can be 
considered a pair of parameters. In this example of heights, both jz and o? must be 
positive. Therefore, the parameter space Q can be taken as the set of all pairs (1, 0) 
such that j2 > 0 and o” > 0. If the normal distribution in this example represents the 
distribution of the heights in inches of the individuals in some particular population, 
we might be certain that 30 < yw < 100ando” < 50. In this case, the parameter space Q 
could be taken as the smaller set of all pairs (u, 0%) such that 30 <u < 100 and 
0<o* <50. 

The important feature of the parameter space Q is that it must contain all possible 
values of the parameters in a given problem, in order that we can be certain that the 
actual value of the vector of parameters is a point in Q. 


A Clinical Trial. Suppose that 40 patients are going to be given a treatment for a 
condition and that we will observe for each patient whether or not they recover from 
the condition. We are most likely also intersted in a large collection of additional 
patients besides the 40 to be observed. To be specific, for each patient i =1,2,..., 
let X; =1 if patient i recovers, and let X; = 0 if not. As a collection of possible 
distributions for X;, X>,..., we could choose to say that the X; are ii.d. having 
the Bernoulli distribution with parameter p for 0 < p < 1. In this case, the parameter 
p is known to lie in the closed interval [0, 1], and this interval could be taken as the 
parameter space. Notice also that the law of large numbers (Theorem 6.2.4) says that 
p is the limit as n goes to infinity of the proportion of the first n patients who recover. 

<l 


In most problems, there is a natural interpretation for the parameter as a feature 
of the possible distributions of our data. In Example 7.1.2, the parameter p has a 
natural interpretation as the proportion out of a large population of patients given 
the treatment who recover from the condition. In Example 7.1.1, the parameter 6 
has a natural interpretation as a failure rate, that is, one over the average lifetime 
of a large population of lifetimes. In such cases, inference about parameters can 
be interpreted as inference about the feature that the parameter represents. In 
this text, all parameters will have such natural interpretations. In examples that 
one encounters outside of an introductory course, interpretations may not be as 
straightforward. 


Examples of Statistical Inference 


Here are some of the examples of statistical models and inferences that were intro- 
duced earlier in the text. 


A Clinical Trial. The clinical trial introduced in Example 2.1.4 was concerned with 
how likely patients are to avoid relapse while under various treatments. For each i, 
let X; =1if patient i in the imipramine group avoids relapse and X; = 0 otherwise. 
Let P stand for the proportion of patients who avoid relapse out of a large group 
receiving imipramine treatment. If P is unknown, we can model Xj, X>,... as iid. 
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Bernoulli random variables with parameter p conditional on P = p. The patients in 
the imipramine column of Table 2.1 should provide us with some information that 
changes our uncertainty about P. A statistical inference would consist of making 
a probability statement about the data and/or P, and what the data and P tell us 
about each other. For instance, in Example 4.7.8, we assumed that P had the uniform 
distribution on the interval [0, 1], and we found the conditional distribution of P given 
the observed results of the study. We also computed the conditional mean of P given 
the study results as well as the M.S.E. for trying to predict P both before and after 
observing the results of the study. | 


Radioactive Particles. In Example 5.7.8, radioactive particles reach a target according 
to a Poisson process with unknown rate f. In Exercise 22 of Sec. 5.7, you were asked 
to find the conditional distribution of 6 after observing the Poisson process for a 
certain amount of time. < 


Anthropometry of Flea Beetles. In Example 5.10.2, we plotted two physical measure- 
ments from a sample of 31 flea beetles together with contours of a bivariate normal 
distribution. The family of bivariate normal distributions is parameterized by five 
quantities: the two means, the two variances, and the correlation. The choice of which 
set of five parameters to use for the fitted distribution is a form of statistical inference 
known as estimation. < 


Interval for Mean. Suppose that the heights of men in a certain population follow 
the normal distribution with mean yu and variance 9, as in Example 5.6.7. This time, 
assume that we do not know the value of the mean jz, but rather we wish to learn about 
it by sampling from the population. Suppose that we decide to sample n = 36 men and 


let X,, stand for the average of their heights. Then the interval (X, — 0.98, X,, + 0.98) 


computed in Example 5.6.8 has the property that it will contain the value of jz with 
probability 0.95. <I 


Discrimination in Jury Selection. In Example 5.8.4, we were interested in whether 
there was evidence of discrimination against Mexican Americans in juror selection. 
Figure 5.8 shows how people who came into the case with different opinions about 
the extent of discrimination (if any) could alter their opinions in the light of learning 
the numerical evidence presented in the case. < 


Service Times in a Queue. Suppose that customers in a queue must wait for service, 
and that we get to observe the service times of several customers. Suppose that we 
are interested in the rate at which customers are served. In Example 5.7.3, we let Z 
stand for the service rate, and in Example 5.7.4, we showed how to find the conditional 
distribution of Z given several observed service times. 4 


General Classes of Inference Problems 


Prediction One form of inference is to try to predict random variables that have 
not yet been observed. In Example 7.1.1, we might be interested in the average of 
the next 10 lifetimes, a Saat X;. In the clinical trial example (Example 7.1.3), we 
might be interested in predicting how many patients from the next set of patients 
in the imipramine group will have successful outcome. In virtually every statistical 


inference problem, in which we have not observed all of the relevant data, prediction 
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is possible. When the unobserved quantity to be predicted is a parameter, prediction 
is usually called estimation, as in Example 7.1.5. 


Statistical Decision Problems In many statistical inference problems, after the ex- 
perimental data have been analyzed, we must choose a decision from some available 
class of decisions with the property that the consequences of each available decision 
depend on the unknown value of some parameter. For example, we might have to 
estimate the unknown failure rate 6 of our electronic components when the con- 
sequences depend on how close our estimate is to the correct value 6. As another 
example, we might have to decide whether the unknown proportion P of patients in 
the imipramine group (Example 7.1.3) is larger or smaller than some specified con- 
stant when the consequences depend on where P lies relative to the constant. This 
last type of inference is closely related to hypothesis testing, the subject of Chapter 9. 


Experimental Design Insome statistical inference problems, we have some control 
over the type or the amount of experimental data that will be collected. For example, 
consider an experiment to determine the mean tensile strength of a certain type of 
alloy as a function of the pressure and temperature at which the alloy is produced. 
Within the limits of certain budgetary and time constraints, it may be possible for 
the experimenter to choose the levels of pressure and temperature at which experi- 
mental specimens of the alloy are to be produced, and also to specify the number of 
specimens to be produced at each of these levels. 

Such a problem, in which the experimenter can choose (at least to some extent) 
the particular experiment that is to be carried out, is called a problem of experimental 
design. Of course, the design of an experiment and the statistical analysis of the 
experimental data are closely related. One cannot design an effective experiment 
without considering the subsequent statistical analysis that is to be carried out on 
the data that will be obtained. And one cannot carry out a meaningful statistical 
analysis of experimental data without considering the particular type of experiment 
from which the data were derived. 


Other Inferences The general classes of problems described above, as well as the 
more specific examples that appeared earlier, are intended as illustrations of types 
of statistical inferences that we will be able to perform with the theory and methods 
introduced in this text. The range of possible models, inferences, and methods that 
can arise when data are observed in real research problems far exceeds what we can 
introduce here. It is hoped that gaining an understanding of the problems that we 
can cover here will give the reader an appreciation for what needs to be done when 
a more challenging statistical problem arises. 


Definition of a Statistic 


Failure Times of Ball Bearings. In Example 5.6.9, we had a sample of the numbers of 
millions of revolutions before failure for 23 ball bearings. We modeled the lifetimes 
as a random sample from a lognormal distribution. We might suppose that the 
parameters jz and o” of that lognormal distribution are unknown and that we might 
wish to make some inference about them. We would want to make use of the 23 
observed values in making any such inference. But do we need to keep track of all 
23 values or are there some summaries of the data on which our inference will be 
based? J 
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Each statistical inference that we will learn how to perform in this book will be 
based on one or a few summaries of the available data. Such data summaries arise 
so often and are so fundamental to inference that they receive a special name. 


Definition Statistic. Suppose that the observable random variables of interest are Xj,..., Xj. 
7.1.4 Let r be an arbitrary real-valued function of n real variables. Then the random 
variable T =r(X1,..., X,) is called a statistic. 


Three examples of statistics are the sample mean X,,, the maximum Y,, of the 
values of X;,..., X,, and the function r(X,, ..., X,,), which has the constant value 
3 for all values of Xj, ..., Xy.- 


Example Failure Times of Ball Bearings. In Example 7.1.9, suppose that we were interested in 
7.1.10 making a statement about how far yz is from 40. Then we might want to use the statistic 


in our inference procedure. In this case, T is a naive measure of how far the data 
suggest that 2 is from 40. 4 


Example Interval for Mean. In Example 7.1.6, we constructed an interval that has probability 
7.111 0.95 of containing jw. The endpoints of that interval, namely, X,, — 0.98 and X,, + 0.98, 
are statistics. < 


Many inferences can proceed without explicitly constructing statistics as a pre- 
liminary step. However, most inferences will involve the use of statistics that could 
be identified in advance. And knowing which statistics are useful in which inferences 
can greatly simplify the implementation of the inference. Expressing an inference in 
terms of statistics can also help us to decide how well the inference meets out needs. 
For instance, in Example 7.1.10, if we estimate | — 40| by 7, we can use the distri- 
bution of T to help determine how likely it is that T differs from | — 40| by a large 
amount. As we construct specific inferences later in this book, we will draw attention 
to those statistics that play important roles in the inference. 


Parameters as Random Variables 


There is some controversy over whether parameters should be treated as random 
variables or merely as numbers that index a distribution. For instance, in Exam- 
ple 7.1.3, we let P stand for the proportion of the patients who avoid relapse from 
a large group receiving imipramine. We then say that X,, X>,... are i.i.d. Bernoulli 
random variables with parameter p conditional on P = p. Here, we are explicitly 
thinking of P as a random variable, and we give it a distribution. An alternative 
would be to say that Xj, X>, ... arei.i.d. Bernoulli random variables with parameter 
p where p is unknown and leave it at that. 

If we really want to compute something like the conditional probability that the 
proportion P is greater than 0.5 given the observations of the first 40 patients, then 
we need the conditional distribution of P given the first 40 patients, and we must 
treat P as a random variable. On the other hand, if we are only interested in making 
probability statements that are indexed by the value of p, then we do not need to 
think about a random variable called P. For example, we might wish to find two 
random variables Y, and Y> (functions of X,,..., X49) such that, no matter what p 
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equals, the probability that Y; < p < Y> is at least 0.9. Some of the inferences that 
we shall discuss later in this book are of the former type that require treating P asa 
random variable, and some are of the latter type in which p is merely an index for a 
distribution. 

Some statisticians believe that it is possible and useful to treat parameters as 
random variables in every statistical inference problem. They believe that the dis- 
tribution of the parameter is a subjective probability distribution in the sense that 
it represents an individual experimenter’s information and subjective beliefs about 
where the true value of the parameter is likely to lie. Once they assign a distribution 
for a parameter, that distribution is no different from any other probability distri- 
bution used in the field of statistics, and all of the rules of probability theory apply 
to every distribution. Indeed, in all of the cases described in this book, the parame- 
ters can actually be identified as limits of functions of large collections of potential 
observations. Here is a typical example. 


Parameter as a Limit of Random Variables. In Example 7.1.3, the parameter P can be 
understood as follows: Imagine an infinite sequence of potential patients receiving 
imipramine treatment. Assume that for every integer n, the outcomes of every or- 
dered subset of n patients from that infinite sequence has the same joint distribution 
as the outcomes of every other ordered subset of n patients. In other words, assume 
that the order in which the patients appear in the sequence is irrelevant to the joint 
distribution of the patient outcomes. Let P,, be the proportion of patients who don’t 
relapse out of the first n. It can be shown that the probability is 1 that P, converges 
to something as n — oo. That something can be thought of as P, which we have been 
calling the proportion of successes in a very large population. In this sense, P is a ran- 
dom variable because it is a function of other random variables. A similar argument 
can be made in all of the statistical models in this book involving parameters, but 
the mathematics needed to make these arguments precise is too advanced to present 
here. (Chapter 1 of Schervish (1995) contains the necessary details.) Statisticians who 
argue as in this example are said to adhere to the Bayesian philosophy of statistics 
and are called Bayesians. S| 


There is another line of reasoning that leads naturally to treating P as a ran- 
dom variable in Example 7.1.12 without relying on an infinite sequence of potential 
patients. Suppose that the number of potential patients is enough larger than any sam- 
ple that we will see to make the approximation in Theorem 5.3.4 applicable. Then 
P is just the proportion of successes among the large population of potential pa- 
tients. Conditional on P = p, the number of successes in a sample of n patients will 
be approximately a binomial random variable with paramters n and p according to 
Theorem 5.3.4. If the outcomes of the patients in the sample are random variables, 
then it makes sense that the proportion of successes among those patients is also 
random. 

There is another group of statisticians who believe that in many problems it 
is not appropriate to assign a distribution to a parameter but claim instead that 
the true value of the parameter is a certain fixed number whose value happens to 
be unknown to the experimenter. These statisticians would assign a distribution to 
a parameter only when there is extensive previous information about the relative 
frequencies with which similar parameters have taken each of their possible values 
in past experiments. If two different scientists could agree on which past experiments 
were similar to the present experiment, then they might agree on a distribution 
to be assigned to the parameter. For example, suppose that the proportion 6 of 
defective items in a certain large manufactured lot is unknown. Suppose also that 
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the same manufacturer has produced many such lots of items in the past and that 
detailed records have been kept about the proportions of defective items in past lots. 
The relative frequencies for past lots could then be used to construct a distribution 
for 9. Statisticians who would argue this way are said to adhere to the frequentist 
philosophy of statistics and are called frequentists. 

The frequentists rely on the assumption that there exist infinite sequences of 
random variables in order to make sense of most of their probability statements. Once 
one assumes the existence of such an infinite sequence, one finds that the parameters 
of the distributions being used are limits of functions of the infinite sequences, just as 
do the Bayesians described above. In this way, the parameters are random variables 
because they are functions of random variables. The point of disagreement between 
the two groups is whether it is useful or even possible to assign a distribution to such 
parameters. 

Both Bayesians and frequentists agree on the usefulness of families of distri- 
butions for observations indexed by parameters. Bayesians refer to the distribution 
indexed by parameter value @ as the conditional distribution of the observations 
given that the parameter equals 6. Frequentists refer to the distribution indexed by 
@ as the distribution of the observations when 6 is the true value of the parameter. 
The two groups agree that whenever a distribution can be assigned to a parameter, 
the theory and methods to be described in this chapter are applicable and useful. In 
Sections 7.2-7.4, we shall explicitly assume that each parameter is a random random 
variable and we shall assign it a distribution that represents the probabilities that the 
parameter lies in various subsets of the parameter space. Beginning in Sec. 7.5, we 
shall consider techniques of estimation that are not based on assigning distributions 
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1. Identify the components of the statistical model (as 
defined in Definition 7.1.1) in Example 7.1.3. 


2. Identify two statistical inferences mentioned in Exam- 
ple 7.1.3. 


3. In Examples 7.1.4 and 5.7.8 (page 323), identify the 
components of the statistical model as defined in Defini- 
tion 7.1.1. 


4. In Example 7.1.6, identify the components of the sta- 
tistical model as defined in Definition 7.1.1. 


5. In Example 7.1.6, identify any statistical inference men- 
tioned. 


6. In Example 5.8.3 (page 328), identify the components 
of the statistical model as defined in Definition 7.1.1. 


7. In Example 5.4.7 (page 293), identify the components 
of the statistical model as defined in Definition 7.1.1. 
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7.2 Prior and Posterior Distributions 


The distribution of a parameter before observing any data is called the prior 
distribution of the parameter. The conditional distribution of the parameter given 
the observed data is called the posterior distribution. If we plug the observed values 
of the data into the conditional p.f. or p.d.f. of the data given the parameter, the 
result is a function of the parameter alone, which is called the likelihood function. 


The Prior Distribution 


Lifetimes of Electronic Components. In Example 7.1.1, lifetimes X,, X>, ... of elec- 
tronic components were modeled as i.i.d. exponential random variables with param- 
eter 6 conditional on 6, and 6 was interpreted as the failure rate of the components. 
Indeed, we noted that n/ >~"_, X; should converge in probability to @ as n goes to 
oo. We then said that 6 had the gamma distribution with parameters 1 and 2. < 


The distribution of 9 mentioned at the end of Example 7.2.1 was assigned before ob- 
serving any of the component lifetimes. For this reason, we call it a prior distribution. 


Prior Distribution/p.f./p.d.f. Suppose that one has a statistical model with parameter 0. 
If one treats 9 as random, then the distribution that one assigns to 6 before observing 
the other random variables of interest is called its prior distribution. If the parameter 
space is at most countable, then the prior distribution is discrete and its p.f. is called 
the prior p.f: of @. If the prior distribution is a continuous distribution, then its p.d.f. 
is called the prior p.d.f. of 96. We shall commonly use the symbol &(6) to denote the 
prior p.f. or p.d.f. as a function of 6. 


When one treats the parameter as a random variable, the name “prior distribu- 
tion” is merely another name for the marginal distribution of the parameter. 


Fair or Two-Headed Coin. Let 6 denote the probability of obtaining a head when a 
certain coin is tossed, and suppose that it is known that the coin either is fair or has 
a head on each side. Therefore, the only possible values of 6 are @ = 1/2 and 6 = 1. If 
the prior probability that the coin is fair is 0.8, then the prior p.f. of 6 is €(1/2) = 0.8 
and €(1) = 0.2. < 


Proportion of Defective Items. Suppose that the proportion 6 of defective items in a 
large manufactured lot is unknown and that the prior distribution assigned to 6 is the 
uniform distribution on the interval [0, 1]. Then the prior p.d.f. of 0 is 
1 for0<06@ <1, 


; (7.2.1) 
0 otherwise. 


< 


c0)=| 


The prior distribution of a parameter 6 must be a probability distribution over 
the parameter space Q. We assume that the experimenter or statistician will be able 
to summarize his previous information and knowledge about where in Q the value of 
@ is likely to lie by constructing a probability distribution on the set Q. In other words, 
before the experimental data have been collected or observed, the experimenter’s 
past experience and knowledge will lead him to believe that 6 is more likely to lie 
in certain regions of Q than in others. We shall assume that the relative likelihoods 
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of the different regions can be expressed in terms of a probability distribution on Q, 
namely, the prior distribution of 6. 


Lifetimes of Fluorescent Lamps. Suppose that the lifetimes (in hours) of fluorescent 
lamps of a certain type are to be observed and that the the lifetime of any particular 
lamp has the exponential distribution with parameter 0. Suppose also that the exact 
value of @ is unknown, and on the basis of previous experience the prior distribution 
of 6 is taken as the gamma distribution for which the mean is 0.0002 and the standard 
deviation is 0.0001. We shall determine the prior p.d-f. of 0. 

Suppose that the prior distribution of @ is the gamma distribution with param- 
eters @ and Bp. It was shown in Theorem 5.7.5 that the mean of this distribution 
iS p/p and the variance is ag/ Be. Therefore, a / 69 = 0.0002 and a! /Bo = 0.0001. 
Solving these two equations gives ag = 4 and By = 20,000. It follows from Eq. (5.7.13) 
that the prior p.d-f. of 6 for 6 > 0 is as follows: 


£(6) = (20,000)* 53 20,0000 
3! , 


Also, €(9) = 0 for 6 < 0. < 


(7.2.2) 


In the remainder of this section and Sections 7.3 and 7.4, we shall focus on 
statistical inference problems in which the parameter @ is a random variable of 
interest and hence will need to be assigned a distribution. In such problems, we shall 
refer to the distribution indexed by @ for the other random variables of interest 
as the conditional distribution for those random variables given 6. For example, 
this is precisely the language used in Example 7.2.1 where the parameter is 0, the 
failure rate. In referring to the conditional p.f. or p.d.f. of random variables, such as 
X,, X9,...in Example 7.2.1, we shall use the notation of conditional p.f.’s and p.d.f’s. 


For example, if we let X = (X,,..., X,,) in Example 7.2.1, the conditional p.d-f. of 
X given 6 is 
m = a ee . 
F018) = 6” exp(—O[x1 +::++x,]) for all > 0, (72.3) 
0 otherwise. 
In many problems, such as Example 7.2.1, the observable data X,, X>,... are 


modeled as a random sample from a univariate distribution indexed by @. In these 
cases, let f(x|9) denote the p.f. or p.d.f. of a single random variable under the 
distribution indexed by @. In such a case, using the above notation, 


fn 10) = f (%4|8) << f %ml8)- 


When we treat 6 as a random variable, f(x|@) is the conditional p.f. or p.d-f. of 
each observation X; given 0, and the observations are conditionally i.i.d. given 6. 
In summary, the following two expressions are to be understood as equivalent: 


e X,,..., X, form a random sample with pf. or p.d.f. f(x|@). 
e X,,..., X, are conditionally i.i.d. given 6 with conditional p-f. or p.d-f. f(«|@). 
Although we shall generally use the wording in the first bullet above for simplicity, 


it is often useful to remember that the two wordings are equivalent when we treat 0 
as a random variable. 


Sensitivity Analysis and Improper Priors In Example 2.3.8 on page 84, we saw a 
situation in which two very different sets of prior probabilities were used for a col- 
lection of events. After we observed data, however, the posterior probabilities were 
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quite similar. In Example 5.8.4 on page 330, we used a large collection of prior dis- 
tributions for a parameter in order to see how much impact the prior distribution 
had on the posterior probability of a single important event. It is a common practice 
to compare the posterior distributions that arise from several different prior distri- 
butions in order to see how much effect the prior distribution has on the answers to 
important questions. Such comparisons are called sensitivity analysis. 

It is very often the case that different prior distributions do not make much 
difference after the data have been observed. This is especially true if there are a lot of 
data or if the prior distributions being compared are very spread out. This observation 
has two important implications. First, the fact that different experimenters might not 
agree on a prior distribution becomes less important if there are a lot of data. Second, 
experimenters might be less inclined to spend time specifying a prior distribution if 
it is not going to matter much which one is specified. Unfortunately, if one does not 
specify some prior distribution, there is no way to calculate a conditional distribution 
of the parameter given the data. 

As an expedient, there are some calculations available that attempt to capture 
the idea that the data contain much more information than is available a priori. 
Usually, these calculations involve using a function &(@) as if it were a prior p.d.f. for 
the parameter 6 but such that { €(@) d@ = oo, which clearly violates the definition 
of p.d.f. Such priors are called improper. We shall discuss improper priors in more 
detail in Sec. 7.3. 


The Posterior Distribution 


Lifetimes of Fluorescent Lamps. In Example 7.2.4, we constructed a prior distribution 
for the parameter @ that specifies the exponential distribution for a collection of life- 
times of fluorescent lamps. Suppose that we observe a collection of n such lifetimes. 
How would we change the distribution of 6 to take account of the observed data? 

< 


Posterior Distribution/p.f./p.d.f. Consider a statistical inference problem with param- 
eter 0 and random variables Xj, ..., X,, to be observed. The conditional distribution 
of 6 given Xj,..., X, is called the posterior distribution of 6. The conditional p.f. or 
p.d.f. of @ given Xj =x ,..., X, =x, 18 called the posterior p.f or posterior p.d.f of @ 
and is typically denoted €(O|x,, ..., x,). 


When one treats the parameter as a random variable, the name “posterior dis- 
tribution” is merely another name for the conditional distribution of the parameter 
given the data. Bayes’ theorem for random variables (3.6.13) and for random vec- 
tors (3.7.15) tells us how to compute the posterior p.d.f. or p.f. of 6 after observing 
data. We shall review the derivation of Bayes’ theorem here using the specific nota- 
tion of prior distributions and parameters. 


Suppose that the n random variables X,,..., X,, form a random sample from a 

distribution for which the p.d.f. or the p.f. is f(x|@). Suppose also that the value of 

the parameter @ is unknown and the prior p.d.f. or p.f. of @ is (0). Then the posterior 

p.d.f. or p.f. of @ is 

F119) +++ FICO) 
8n(*) 

where g,, is the marginal joint p.d-f. or p.f. of X1,..., X,. 


ford €Q, 


C(@|x) = 
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Proof For simplicity, we shall assume that the parameter space Q is either an interval 
of the real line or the entire real line and that €(@) is a prior p.d.f. on Q, rather than 
a prior p.f. However, the proof that will be given here can be adapted easily to a 
problem in which &(@) is a p.f. 

Since the random variables X,,..., X,, form arandom sample from the distribu- 
tion for which the p.d.f. is f (x|@), it follows from Sec. 3.7 that their conditional joint 
p.d.f. or p-f. fv, ..-, %,/9) given 6 is 


Fn Qs +++ XnlO) = flO) >>> Fn 9). (7.2.4) 


If we use the vector notation x = (x,,...,x,), then the joint p.d-f. in Eq. (7.2.4) 
can be written more compactly as f,,(x|@). Eq. (7.2.4) merely expresses the fact that 
X1,..., X, are conditionally independent and identically distributed given 6, each 
having p.d.f. or pf. f(x|@). 

If we multiply the conditional joint p.d.f. or p.f. by the p.d-f. €(@), we obtain the 
(n + 1)-dimensional joint p.d.f. (or p.f./p.d.-f.) of X1,..., X, and @ in the form 


F(X, 0) = f,(x10)E@). (7.2.5) 
The marginal joint p.d-f. or p.f. of X,,..., X,, can now be obtained by integrating 
the right-hand side of Eq. (7.2.5) over all values of 6. Therefore, the n-dimensional 
marginal joint p.d.f. or p.f. g,(v) of X;,..., X, can be written in the form 
el) = fF fuCe10)E(0) d¥. (7.26) 
Q 
Eq. (7.2.6) is just an instance of the law of total probability for random vectors 
(3.7.14). 
Furthermore, the conditional p.d.f. of 6 given that X; =x,,..., X, =x,, namely, 
é(0|x), must be equal to f(x, @) divided by g, (x). Thus, we have 
Ex) = 2AO tor gen, (72.7) 
8n(X) 


which is Bayes’ theorem restated for parameters and random samples. If €(@) is a 
p.f£., so that the prior distribution is discrete, just replace the integral in (7.2.6) by the 
sum over all of the possible values of 0. a 


Lifetimes of Fluorescent Lamps. Suppose again, as in Examples 7.2.4 and 7.2.5, that the 
distribution of the lifetimes of fluorescent lamps of a certain type is the exponential 
distribution with parameter 6, and the prior distribution of 6 is a particular gamma 
distribution for which the p.d.f. €(@) is given by Eq. (7.2.2). Suppose also that the 
lifetimes X,,..., X,, of arandom sample of n lamps of this type are observed. We 
shall determine the posterior p.d-f. of 6 given that X, =x ,..., X;, =Xp- 

By Eq. (5.7.16), the p.d.f. of each observation X; is 


—Ox 
este) =| °* for x ae 
0 otherwise. 
The joint p.d.f. of X,,..., X,, can be written in the following form, for x; > 0 (i = 
1,...,n): 
n 
fa(xl6) =| [dc =0"e, 
i=l 
where y = )°"_, x;. As f,,(¥|@) will be used in constructing the posterior distribution 


of 0, it is now apparent that the statistic Y = }°"_, X; will be used in any inference 
that makes use of the posterior distribution. 


Figure 7.1 Prior and poste- 
rior p.d.f.’s in Example 7.2.6. 
The top panel is based on the 
original prior. The bottom 
panel is based on the alterna- 
tive prior that was part of the 
sensitivity analysis. 
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Since the prior p.d.f. €(@) is given by Eq. (7.2.2), it follows that for 6 > 0, 
LADO) =o re Vee, (7.2.8) 


We need to compute g,,(x), which is the integral of (7.2.8) over all 6: 


2,(x)= i’ 9nt3o-(+20,000)8 79 — ae) 
" 0 (y + 20,000)"+4’ 


where the last equality follows from Theorem 5.7.3. Hence, 


n+3,—(y+20,000)6 


Tin +4) 
(y + 20,000)"+4 (7.2.9) 


_ (y £20,000)" (5.20,000)6 
T(n +4) , 


for 0 > 0. When we compare this expression with Eq. (5.7.13), we can see that it is 
the p.d-f. of the gamma distribution with parameters n + 4 and y + 20,000. Hence, 
this gamma distribution is the posterior distribution of 6. 

As a specific example, suppose that we observe the following n = 5 lifetimes 
in hours: 2911, 3403, 3237, 3509, and 3118. Then y = 16,178, and the posterior 
distribution of 6 is the gamma distribution with parameters 9 and 36,178. The top 
panel of Fig. 7.1 displays both the prior and posterior p.d.f’s in this example. It is 
clear that the data have caused the distribution of 6 to change somewhat from the 
prior to the posterior. 

At this point, it might be appropriate to perform a sensitivity analysis. For 
example, how would the posterior distribution change if we had chosen a different 
prior distribution? To be specific, consider the gamma prior with parameters 1 and 
1000. This prior has the same standard deviation as the original prior, but the mean 
is five times as big. The posterior distribution would then be the gamma distribution 
with parameters 6 and 17,178. The p.d.f.’s of this pair of prior and posterior are plotted 
in the lower panel of Fig. 7.1. One can see that both the prior and the posterior in 
the bottom panel are more spread out than their counterparts in the upper panel. It 


c(@|x) = 


Original prior distribution 


--- Prior 
Posterior 


0 0.0005 0.0010 0.0015 8 


Alternative prior distribution 


--- Prior 
Posterior 


0.0005 0.0010 0.0015 9 
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is clear that the choice of prior distribution is going to make a difference with this 
small data set. < 


The names “prior” and “posterior” derive from the Latin words for “former” 
and “coming after.” The prior distribution is the distribution of 6 that comes before 
observing the data, and posterior distribution comes after observing the data. 


The Likelihood Function 


The denominator on the right side of Eq. (7.2.7) is simply the integral of the numer- 
ator over all possible values of 0. Although the value of this integral depends on 
the observed values x1, ..., x,, it does not depend on 6 and it may be treated as a 
constant when the right-hand side of Eq. (7.2.7) is regarded as a p.d.f. of 6. We may 
therefore replace Eq. (7.2.7) with the following relation: 


C(O |x) x fn (¥1A)E (8). (7.2.10) 


The proportionality symbol « is used here to indicate that the left side is equal to the 
right side except possibly for a constant factor, the value of which may depend on 
the observed values x,,..., x,, but does not depend on 6. The appropriate constant 
factor that will establish the equality of the two sides in the relation (7.2.10) can be 
determined at any time by using the fact that {, €(@|x) do =1, because €(0|x) is a 
p.d.f. of @. 

One of the two functions on the right-hand side of Eq. (7.2.10) is the prior p.d.f. 
of 6. The other function has a special name also. 


Likelihood Function. When the joint p.d.f. or the joint p.f. f,(¥|@) of the observations 
in a random sample is regarded as a function of 6 for given values of x1, ..., x,, it is 
called the likelihood function. 


The relation (7.2.10) states that the posterior p.d.f. of 6 is proportional to the product 
of the likelihood function and the prior p.d.f. of 0. 

By using the proportionality relation (7.2.10), it is often possible to determine 
the posterior p.d.f. of @ without explicitly performing the integration in Eq. (7.2.6). 
If we can recognize the right side of the relation (7.2.10) as being equal to one of the 
standard p.d.f’s introduced in Chapter 5 or elsewhere in this book, except possibly 
for a constant factor, then we can easily determine the appropriate factor that will 
convert the right side of (7.2.10) into a proper p.d.f. of 6. We shall illustrate these 
ideas by considering again Example 7.2.3. 


Proportion of Defective Items. Suppose again, as in Example 7.2.3, that the proportion 
@ of defective items in a large manufactured lot is unknown and that the prior 
distribution of 6 is a uniform distribution on the interval [0, 1]. Suppose also that 
a random sample of n items is taken from the lot, and fori =1,...,n, let X; =1if 
the ith item is defective, and let X; = 0 otherwise. Then X,, ..., X,, form Bernoulli 
trials with parameter 6. We shall determine the posterior p.d.f. of 6. 

It follows from Eq. (5.2.2) that the p.f. of each observation X; is 


Q 1-) = 
F (x10) = | 6*(1— 0)! for x =0, i, 
otherwise. 


Hence, if we let y = )°"_, x;, then the joint p.f. of X;,..., X,, can be written in the 


following form for x; =0or1 (i =1,..., 7): 


fr(¥|0) = 01-0)". (7.2.11) 
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Since the prior p.d.f. €(@) is given by Eq. (7.2.1), it follows that for 0 < @ <1, 
fr(¥lAye (8) =O" — a)". (7213) 


When we compare this expression with Eq. (5.8.3), we can see that, except for 

a constant factor, it is the p.d.f. of the beta distribution with parameters a = y + 1 

and 6 =n — y +1. Since the posterior p.d.f. €(6|x) is proportional to the right side 

of Eq. (7.2.12), it follows that €(6|x) must be the p.d.f. of the beta distribution with 
parameters a = y+1and 6B =n — y +1. Therefore, for 0 <6 <1, 

£(6|x) = Pee?) er —ey"-?. (7.2.13) 

Pot+bDra-y+D 

In this example, the statistic Y = }~”_, X; is being used to construct the posterior 

distribution, and hence will be used in any inference that is based on the posterior 

distribution. < 


Note: Normalizing Constant for Posterior p.d.f. The steps that got us from (7.2.12) 
to (7.2.13) are an example of a very common technique for determining a posterior 
p.d.f. We can drop any inconvenient constant factor from the prior p.d.f. and from the 
likelihood function before we multiply them together as in (7.2.10). Then we look at 
the resulting product, call it g(@), to see if we recognize it as looking like part of a 
p.d.f. that we have seen elsewhere. If indeed we find a named distribution with p.d.f. 
equal to cg(@), then our posterior p.d-f. is also cg(@), and our posterior distribution 
has the corresponding name, just as in Example 7.2.7. 


Sequential Observations and Prediction 


In many experiments, the observations X;,..., X,, which form the random sample, 
must be obtained sequentially, that is, one at a time. In such an experiment, the 
value of X, is observed first, the value of X> is observed next, the value of X3 is then 
observed, and so on. Suppose that the prior p.d.f. of the parameter 6 is (6). After 
the value x, of X; has been observed, the posterior p.d.f. €(6|x,) can be calculated in 
the usual way from the relation 


CA |xy) x f(%119)E@). (7.2.14) 


Since X, and X, are conditionally independent given 6, the conditional p.f. or 
p.d.f. of X, given 6 and X, = x, is the same as that given 6 alone, namely, f(x|0). 
Hence, the posterior p.d.f. of @ in Eq. (7.2.14) serves as the prior p.d.f. of @ when the 
value of X> is to be observed. Thus, after the value x» of X, has been observed, the 
posterior p.d.f. €(@|x1, x2) can be calculated from the relation 


SOx, x2) « f(x2|)6 (|x1). (7.2.15) 


We can continue in this way, calculating an updated posterior p.d-f. of 9 after each 
observation and using that p.d.f. as the prior p.d.f. of 9 for the next observation. The 


posterior p.d.f. €(@|x1, ..., X,_1) after the values x1, ..., x,_; have been observed will 
ultimately be the prior p.d.f. of @ for the final observed value of X,,. The posterior 
p.d.f. after all values x1, ..., x, have been observed will therefore be specified by 


the relation 


Elx)  flOEOlxp, «+5 Xp (7.2.16) 


Alternatively, after alln values x,, ..., x, have been observed, we could calculate 
the posterior p.d.-f. €(@|x) in the usual way by combining the joint p.d-f. f,(x|6) 
with the original prior p.d.f. €(@), as indicated in Eq. (7.2.7). It can be shown (see 
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Exercise 8) that the posterior p.d.f. €(@|x) will be the same regardless of whether it is 
calculated directly by using Eq. (7.2.7) or sequentially by using Eqs. (7.2.14), (7.2.15), 
and (7.2.16). This property was illustrated in Sec. 2.3 (see page 80) for a coin that is 
known either to be fair or to have a head on each side. After each toss of the coin, 
the posterior probability that the coin is fair is updated. 

The proportionality constants in Eqs. (7.2.14)—(7.2.16) have a useful interpreta- 
tion. For example, in (7.2.16) the proportionality constant is 1 over the integral of 
the right side with respect to 6. But this integral is the conditional p.d-f. or p.f. of X,, 
given X,;=21,..., X,_1 =X,_1, according to the conditional version of the law of 
total probability (3.7.16). For example, if 6 has a continuous distribution, 


Fale + .s%- = / FIDE Olx Xp). (7.2.17) 


The proportionality constant in (7.2.16) is 1 over (7.2.17). So, if we are interested in 
predicting the nth observation in a sequence after observing the first n — 1, we can 
use (7.2.17), which is also 1 over the proportionality constant in Eq. (7.2.16), as the 
conditional p.f. or p.d.f. of X,, given the first n — 1 observations. 


Lifetimes of Fluorescent Lamps. In Example 7.2.6, conditional on 0, the lifetimes of 
fluorescent lamps are independent exponential random variables with parameter 0. 
We also observed the lifetimes of five lamps, and the posterior distribution of 6 was 
found to be the gamma distribution with parameters 9 and 36,178. Suppose that we 
want to predict the lifetime X¢ of the next lamp. 

The conditional p.d.f. of X6, the lifetime of the next lamp, given the first five 
lifetimes equals the integral of €(0|x) f (x6|0) with respect to 6. The posterior p.d.f. of 
6 is E(O|x) = 2.633 x 10569836178? for @ > 0. So, for x5 > 0 


CO 
f (xix) = / 2.633 x 10°°98e 717g *69 do 
0 


CO 
= 2.633 x 10° / 7 ¢— %6+36,178)8 ag (7.2.18) 
0 


r(10) _ 9555 «107 
(x6 + 36,178)!9 (xg + 36,178)10° 


We can use this p.d.f. to perform any calculation we wish concerning the distribution 
of X¢ given the observed lifetimes. For example, the probability that the sixth lamp 
lasts more than 3000 hours equals 


oo 41 mm 
Pr(Xg > 3000|x) = / 9.595 x 108), _ 9.555 x 10 
3000 (x6 + 36,178)!9 9 x 39,1789 


Finally, we can continue the sensitivity analysis that was started in Example 7.2.6. 
If it is important to know the probability that the next lifetime is at least 3000, we can 
see how much influence the choice of prior distribution has made on this calculation. 
Using the second prior distribution (gamma with parameters 1 and 1000), we found 
that the posterior distribution of 9 was the gamma distribution with parameters 6 
and 17,178. We could compute the conditional p.d.f. of X¢ given the observed data 
in the same way as we did with the original posterior, and it would be 


1.542 x 1076 

Ff (xs\x) = — ~* _, 
(x6 + 17,178) 

With this p.d-f., the probability that X¢ > 3000 is 


= 2.633 x 10°° 


= 0.4882. 


for x¢ > 0. (7.2.19) 


Figure 7.2 Two possi- 

ble conditional p.d.f’s, 
Egs. (7.2.18) and (7.2.19) 
for X¢ given the observed 
data in Example 7.2.8. The 
two p.d.f.'s were computed 
using the two different pos- 
terior distributions that were 
derived from the two dif- 
ferent prior distributions in 
Example 7.2.6. 
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Comparison of conditional distributions of next observation 
A 


— Original prior 
-—-— Alternative prior 


0.00030 a1 
\ 
\ 
\ 
\ 


a 
= 0.00020 1 
Q 


0.00010 -- 


-—> 


o 


T T T 
5000 10,000 15,000 20,000 30,000 *6 


25,000 


lo‘e) 26 26 
Pr(X¢ > 3000|x) = 7 io) 1542 x 10” _ 0.3807. 


3000 (xg + 17,178)? © 6 x 20,1786 


As we noted at the end of Example 7.2.6, the different priors make a considerable 
difference in the inferences that we can make. If it is important to have a precise value 
of Pr(X¢ > 3000|x), we need a larger sample. The two different p.d.f’s of X¢ given x 
can be compared in Fig. 7.2. The p.d.f. from Eq. (7.2.18) is higher for intermediate 
values of x6, while the one from Eq. (7.2.19) is higher for the extreme values of x¢. 

<l 


Summary 


The prior distribution of a parameter describes our uncertainty about the parameter 
before observing any data. The likelihood function is the conditional p.d.f. or p.f. of 
the data given the parameter when regarded as a function of the parameter with the 
observed data plugged in. The likelihood tells us how much the data will alter our 
uncertainty. Large values of the likelihood correspond to parameter values where the 
posterior p.d.f. or p.f. will be higher than the prior. Low values of the likelihood occur 
at parameter values where the posterior will be lower than the prior. The posterior 
distribution of the parameter is the conditional distribution of the parameter given 
the data. It is obtained using Bayes’ theorem for random variables, which we first saw 
on page 148. We can predict future observations that are conditionally independent 
of the observed data given 6 by using the conditional version of the law of total 
probability that we saw on page 163. 


1. Consider again the situation described in Example 
7.2.8. This time, suppose that the experimenter believes 
that the prior distribution of 6 is the gamma distribution 
with parameters 1 and 5000. What would this experi- 
menter compute as the value of Pr(X¢ > 3000|x)? 


2. Suppose that the proportion 6 of defective items in a 
large manufactured lot is known to be either 0.1 or 0.2, 
and the prior p.f. of 6 is as follows: 


(0.1) =0.7 and (0.2) =0.3. 


Suppose also that when eight items are selected at ran- 
dom from the lot, it is found that exactly two of them are 
defective. Determine the posterior p.f. of 6. 


3. Suppose that the number of defects on a roll of mag- 
netic recording tape has a Poisson distribution for which 
the mean A is either 1.0 or 1.5, and the prior p-f. of 4 is as 
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follows: 
€(1.0)=0.4 and €(1.5)=0.6. 


If a roll of tape selected at random is found to have three 
defects, what is the posterior p.f. of 4? 


4. Suppose that the prior distribution of some parameter 
@ is a gamma distribution for which the mean is 10 and the 
variance is 5. Determine the prior p.d.f. of 6. 


5. Suppose that the prior distribution of some parameter 
@ is a beta distribution for which the mean is 1/3 and the 
variance is 1/45. Determine the prior p.d-f. of 6. 


6. Suppose that the proportion 6 of defective items in a 
large manufactured lot is unknown, and the prior distribu- 
tion of @ is the uniform distribution on the interval [0, 1]. 
When eight items are selected at random from the lot, it is 
found that exactly three of them are defective. Determine 
the posterior distribution of 6. 


7. Consider again the problem described in Exercise 6, 
but suppose now that the prior p.d.f. of 6 is as follows: 


20-06) for0<6 <1, 
0 otherwise. 


<0) =| 


As in Exercise 6, suppose that in a random sample of eight 
items exactly three are found to be defective. Determine 
the posterior distribution of 6. 


8. Suppose that X;,..., X, form a random sample from 
a distribution for which the p.d.f. is f(x|@), the value of @ 
is unknown, and the prior p.d.f. of 6 is €(@). Show that the 
posterior p.d.f. €(6|x) is the same regardless of whether it 
is calculated directly by using Eq. (7.2.7) or sequentially 
by using Eqs. (7.2.14), (7.2.15), and (7.2.16). 


9. Consider again the problem described in Exercise 6, 
and assume the same prior distribution of 6. Suppose now, 
however, that instead of selecting a random sample of 
eight items from the lot, we perform the following exper- 
iment: Items from the lot are selected at random one by 
one until exactly three defectives have been found. If we 
find that we must select a total of eight items in this exper- 
iment, what is the posterior distribution of 6 at the end of 
the experiment? 


10. Suppose that a single observation X is to be taken 
from the uniform distribution on the interval [6 — 5 
O+ 5] the value of 6 is unknown, and the prior distribu- 
tion of @ is the uniform distribution on the interval [10, 20]. 
If the observed value of X is 12, what is the posterior dis- 
tribution of 6? 


11. Consider again the conditions of Exercise 10, and 
assume the same prior distribution of 6. Suppose now, 
however, that six observations are selected at random 
from the uniform distribution on the interval [6 — 5 
O+ 5], and their values are 11.0, 11.5, 11.7, 11.1, 11.4, and 
10.9. Determine the posterior distribution of 0. 


7.3. Conjugate Prior Distributions 


For each of the most popular statistical models, there exists a family of distributions 
for the parameter with a very special property. If the prior distribution is chosen to 
be amember of that family, then the posterior distribution will also be amember of 
that family. Such a family of distributions is called a conjugate family. Choosing a 
prior distribution from a conjugate family will typically make it particularly simple 
to calculate the posterior distribution. 


Sampling from a Bernoulli Distribution 


Example 
7.3.1 


Theorem 
7.3.1 


A Clinical Trial. In Example 5.8.5 (page 330), we were observing patients in a clini- 
cal trial. The proportion P of successful outcomes among all possible patients was 
a random variable for which we chose a distribution from the family of beta distri- 
butions. This choice made the calculation of the conditional distribution of P given 
the observed data very simple at the end of that example. Indeed, the conditional 
distribution of P given the data was another member of the beta family. <l 


That the result in Example 7.3.1 occurs in general is the subject of the next theorem. 


Suppose that X,,..., X,, form arandom sample from the Bernoulli distribution with 
parameter 6, which is unknown (0 < @ < 1). Suppose also that the prior distribution 
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of 6 is the beta distribution with parameters a > 0 and 6 > 0. Then the posterior dis- 
tribution of 6 given that X; = x; (i =1,..., ) is the beta distribution with parameters 
a> 4 ep and P= Fae 


Theorem 7.3.1 is just a restatement of Theorem 5.8.2 (page 329), and its proof is 
essentially the calculation in Example 5.8.3. 


Updating the Posterior Distribution One implication of Theorem 7.3.1 is the fol- 
lowing: Suppose that the proportion 6 of defective items in a large shipment is un- 
known, the prior distribution of 6 is the beta distribution with parameters a and 6, 
and n items are selected one at a time at random from the shipment and inspected. 
Assume that the items are conditionally independent given 0. If the first item in- 
spected is defective, the posterior distribution of 6 will be the beta distribution with 
parameters w + 1 and £. If the first item is nondefective, the posterior distribution 
will be the beta distribution with parameters a and 8 + 1. The process can be contin- 
ued in the following way: Each time an item is inspected, the current posterior beta 
distribution of 6 is changed to a new beta distribution in which the value of either the 
parameter a or the parameter £ is increased by one unit. The value of @ is increased 
by one unit each time a defective item is found, and the value of 6 is increased by 
one unit each time a nondefective item is found. 


Conjugate Family/Hyperparameters. Let X1, X2, ... be conditionally i.i.d. given 6 with 
common p.f. or p.d.-f. f(x|0). Let Y be a family of possible distributions over the 
parameter space ©. Suppose that, no matter which prior distribution € we choose 
from W, no matter how many observations X = (X;,..., X,,) we observe, and no 
matter what are their observed values x = (x1,..., x,), the posterior distribution 
é(O|x) is amember of W. Then W is called a conjugate family of prior distributions 
for samples from the distributions f(x|@). It is also said that the family W is closed 
under sampling from the distributions f(x|@). Finally, if the distributions in Y are 
parametrized by further parameters, then the associated parameters for the prior 
distribution are called the prior hyperparameters and the associated parameters of 
the posterior distribution are called the posterior hyperparameters. 


Theorem 7.3.1 says that the family of beta distributions is a conjugate family of prior 
distributions for samples from a Bernoulli distribution. If the prior distribution of 6 
is a beta distribution, then the posterior distribution at each stage of sampling will 
also be a beta distribution, regardless of the observed values in the sample. Also, the 
family of beta distributions is closed under sampling from Bernoulli distributions. 
The parameters a and f in Theorem 7.3.1 are the prior hyperparameters. The corre- 
sponding parameters of the posterior distributions (a + }~"_, x; andB +n — }°"_, x;) 
are the posterior hyperparameters. The statistic }*”_, X; is needed to compute the 
posterior distribution, hence it will be needed to perform any inference based on the 
posterior distribution. Exercises 23 and 24 introduce a general collection of p.d.f’s 
f («|@) for which conjugate families of priors exist. Most of the familiar named distri- 
butions are covered by these exercises. The various uniform distributions are notable 
exceptions. 


The Variance of the Posterior Beta Distribution. Suppose that the proportion 6 of 
defective items in a large shipment is unknown, the prior distribution of 0 is the 
uniform distribution on the interval [0, 1], and items are to be selected at random 
from the shipment and inspected until the variance of the posterior distribution of 6 
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has been reduced to the value 0.01 or less. We shall determine the total number of 
defective and nondefective items that must be obtained before the sampling process 
is stopped. 

As stated in Sec. 5.8, the uniform distribution on the interval [0, 1] is the beta 
distribution with parameters 1 and 1. Therefore, after y defective items and z non- 
defective items have been obtained, the posterior distribution of 6 will be the beta 
distribution with a = y + land 6 =z + 1. It was shown in Theorem 5.8.3 that the vari- 
ance of the beta distribution with parameters a and £ is aB/[(a + BY (a+ Bt 1)]. 
Therefore, the variance V of the posterior distribution of 6 will be 


_ (yt )zt+) 
(y+z2+2)2(y+2+3) 


Sampling is to stop as soon as the number of defectives y and the number of non- 
defectives z that have been obtained are such that V < 0.01. It can be shown (see 
Exercise 2) that it will not be necessary to select more than 22 items, but it is neces- 
sary to select at least seven items. <1 


Glove Use by Nurses. Friedland et al. (1992) studied 23 nurses in an inner-city hos- 
pital before and after an educational program on the importance of wearing gloves. 
They recorded whether or not the nurses wore gloves during procedures in which 
they might come in contact with bodily fluids. Before the educational program the 
nurses were observed during 51 procedures, and they wore gloves in only 13 of them. 
Let @ be the probability that a nurse will wear gloves two months after the educa- 
tional program. We might be interested in how 6 compares to 13/51, the observed 
proportion before the program. 

We shall consider two different prior distributions for 6 in order to see how 
sensitive the posterior distribution of 6 is to the choice of prior distribution. The 
first prior distribution will be uniform on the interval [0, 1], which is also the beta 
distribution with parameters 1 and 1. The second prior distribution will be the beta 
distribution with parameters 13 and 38. This second prior distribution has much 
smaller variance than the first and has its mean at 13/51. Someone holding the second 
prior distribution believes fairly strongly that the educational program will have no 
noticeable effect. 

Two months after the educational program, 56 procedures were observed with 
the nurses wearing gloves in 50 of them. The posterior distribution of 6, based 
on the first prior, would then be the beta distribution with parameters 1 + 50 = 51 
and 1+6=7. In particular, the posterior mean of @ is 51/(51 + 7) = 0.88, and the 
posterior probability that 9 > 2 x 13/51is essentially 1. Based on the second prior, the 
posterior distribution would be the beta distribution with parameters 13 + 50 = 63 
and 38 + 6 = 44. The posterior mean would be 0.59, and the posterior probability that 
0 >2 x 13/51is 0.95. So, even to someone who was initially skeptical, the educational 
program seems to have been quite effective. The probability is quite high that nurses 
are at least twice as likely to wear gloves after the program as they were before. 

Figure 7.3 shows the p.d.f.’s of both of the posterior distributions computed 
above. The distributions are clearly very different. For example, the first posterior 
gives probability greater than 0.99 that 6 > 0.7, while the second gives probability 
less than 0.001 to 6 > 0.7. However, since we are only interested in the probability 
that 6 > 2 x 13/51 = 0.5098, we see that both posteriors agree that this probability is 
quite large. <1 


Figure 7.3 Posterior p.d.f’s 
in Example 7.2.6. The curves 
are labeled by the prior that 
led to the corresponding 
posterior. 
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Uniform prior 


-—- Beta (13, 38) prior 


Posterior p.d.f.'s 


Sampling from a Poisson Distribution 


Customer Arrivals. A store owner models customer arrivals as a Poisson process with 
unknown rate 6 per hour. She assigns 9 a gamma prior distribution with parameters 
3 and 2. Let X be the number of customers that arrive in a specific one-hour period. 
If X = 3 is observed, the store owner wants to update the distribution of 6. < 


When samples are taken from a Poisson distribution, the family of gamma 
distributions is a conjugate family of prior distributions. This relationship is shown 
in the next theorem. 


Suppose that X,,..., X,, form a random sample from the Poisson distribution with 
mean 6 > 0, and @ is unknown. Suppose also that the prior distribution of 0 is the 
gamma distribution with parameters a > 0 and £ > 0. Then the posterior distribution 
of 0, given that X; =x; (i =1,...,7), is the gamma distribution with parameters 
w+ > 4 and B+ a. 


Proof Let y=)~"_, x;. Then the likelihood function /f,,(x|@) satisfies the relation 
fr(xl0) «e"?. 


In this relation, a factor that involves x but does not depend on @ has been dropped 
from the right side. Furthermore, the prior p.d-f. of 6 has the form 


E(0) «0% 'e-P?— for 6 > 0. 
Since the posterior p.d.f. €(6|x) is proportional to f,,(x|@)é(@), it follows that 
E(O|x) cO%t9-1e-B+8 for 6 > 0. 


The right side of this relation can be recognized as being, except for a constant factor, 
the p.d.f. of the gamma distribution with parameters a + y and 6 +n. Therefore, the 
posterior distribution of 6 is as specified in the theorem. rT] 


In Theorem 7.3.2, the numbers @ and # are the prior hyperparameters, while a + 
dy, x; and B +n are the posterior hyperparameters. Note that the statistic Y = 
” X, is used to compute the posterior distribution of 6, and hence it will be part 
j=1°0 p p p 
of any inference based on the posterior. 
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Customer Arrivals. In Example 7.3.4, we can apply Theorem 7.3.2 with n = 1, a =3, 
6 =2, and x, =3. The posterior distribution of 6 given X = 3 is the gamma distribu- 
tion with parameters 6 and 3. < 


The Variance of the Posterior Gamma Distribution. Consider a Poisson distribution for 
which the mean 6 is unknown, and suppose that the prior p.d_-f. of 6 is as follows: 


Je" for6 > 0, 
0 for 6 <0. 


Suppose also that observations are to be taken at random from the given Poisson 
distribution until the variance of the posterior distribution of 9 has been reduced to 
the value 0.01 or less. We shall determine the number of observations that must be 
taken before the sampling process is stopped. 

The given prior p.d.f. €(@) is the p.d.f. of the gamma distribution with prior 
hyperparameters a = 1 and 6 = 2. Therefore, after we have obtained n observed 
values x1, ..., X,, the sum of which is y = }>"_, x;, the posterior distribution of 6 will 
be the gamma distribution with posterior hyperparameters y + 1 and n + 2. It was 
shown in Theorem 5.4.2 that the variance of the gamma distribution with parameters 
a and f is a/f?. Therefore, the variance V of the posterior distribution of 6 will be 


<o)=| 


_ yl 
(n +2)?" 


Sampling is to stop as soon as the sequence of observed values x, ... , x, is such that 
V <0.01. Unlike Example 7.3.2, there is no uniform bound on how large n needs to 
be because y can be arbitrarily large no matter what n is. Clearly, it takes at least 
n = 8 observations before V < 0.01. < 


Sampling from a Normal Distribution 


Automobile Emissions. Consider again the sampling of automobile emissions, in par- 
ticular oxides of nitrogen, described in Example 5.6.1 on page 302. Prior to observing 
the data, suppose that an engineer believed that each emissions measurement had the 
normal distribution with mean @ and standard deviation 0.5 but that 9 was unknown. 
The engineer’s uncertainty about 6 might be described by another normal distribu- 
tion with mean 2.0 and standard deviation 1.0. After seeing the data in Fig. 5.1, how 
would this engineer describe her uncertainty about 6? < 


When samples are taken from a normal distribution for which the value of the 
mean 6 is unknown but the value of the variance o” is known, the family of normal 
distributions is itself a conjugate family of prior distributions, as is shown in the next 
theorem. 


Suppose that X,,..., X,, form arandom sample from a normal distribution for which 
the value of the mean 6 is unknown and the value of the variance o” > 0 is known. 
Suppose also that the prior distribution of @ is the normal distribution with mean p19 
and variance ve. Then the posterior distribution of 6 given that X; = x; (i=1,...,7) 
is the normal distribution with mean jz, and variance ve where 


2 2> 
oO + NUAX 
p= o Mot Morn (7.3.1) 


o2 +nve 
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and 


y=. (7.3.2) 


Proof The likelihood function. f,,(x|0) has the form 


1 n 
fn (X10) ex] ss Yair 07 
i=1 


Here a constant factor has been dropped from the right side. The method of com- 
pleting the square (see Exercise 24 in Sec. 5.6) tells us that 


n 


>; — 0 =n6 —%,)? +); -— ¥,)?. 


i=1 i=1 
By omitting a factor that involves x), ..., x, but does not depend on 6, we may rewrite 
F,(x|@) in the following form: 


fr(xl0) ox exp|— 550 _ ¥,)°| 


Since the prior p.d.f. €(@) has the form 


1 
E(0) x ex] she - 1? 


it follows that the posterior p.d.f. €(6|x) satisfies the relation 
1} n = 1 

E(O|x) x exo{-3] 0 = %,)° + aC _ 10 | 
O: U9 


If 4, and ve are as specified in Eqs. (7.3.1) and (7.3.2), completing the square 
again establishes the following identity: 


n _ il 1 n = 
(0 — Fy)? + SO - 0)? = 56 - 1)? + =— Gn — Mo)”. 
oO Vo Uy oF + NU 
Since the final term on the right side of this equation does not involve @, it can be 
absorbed in the proportionality factor, and we obtain the relation 


1 
E(B|x) ex he 7 a? 


The right side of this relation can be recognized as being, except for a constant factor, 
the p.d.f. of the normal distribution with mean jy, and variance ee Therefore, the 
posterior distribution of 6 is as specified in the theorem. = 


In Theorem 7.3.3, the numbers jg and vA are the prior hyperparameters, while 1, 
and vy are the posterior hyperparameters. Notice that the statistic X,, is used in the 
construction of the posterior distribution, and hence will play a role in any inference 
based on the posterior. 


Automobile Emissions. We can apply Theorem 7.3.3 to answer the question at the end 
of Example 7.3.7. In the notation of the theorem, we have n = 46, o7 = 0.57 = 0.25, 
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[ug = 2, and v? = 1.0. The average of the 46 measurements is ¥,, = 1.329. The posterior 
distribution of 6 is then the normal distribution with mean and variance given by 


_ 0.25 x 2 +46 x 1 x 1.329 


= = 1.333, 

os 0.25 +46 x 1 

pe Ep nisa: < 
0.25446 x1 


The mean 1; of the posterior distribution of 6, as given in Eq. (7.3.1), can be 
rewritten as follows: 


2 2 
o~ NVUy _ 
M=> zo + 5 5Xn- (7.3.3) 
o* + No Oo + NU 


It can be seen from Eq. (7.3.3) that jz; isa weighted average of the mean jy of the prior 
distribution and the sample mean x,,. Furthermore, it can be seen that the relative 
weight given to x, satisfies the following three properties: (1) For fixed values of vp 
and o”, the larger the sample size n, the greater will be the relative weight that is given 
to X,. (2) For fixed values of vs and n, the larger the variance o* of each observation 
in the sample, the smaller will be the relative weight that is given to X,,. (3) For fixed 
values of o” and n, the larger the variance oF of the prior distribution, the larger will 
be the relative weight that is given to X,. 

Moreover, it can be seen from Eq. (7.3.2) that the variance vs of the posterior 
distribution of 6 depends on the number n of observations that have been taken 
but does not depend on the magnitudes of the observed values. Suppose, therefore, 
that a random sample of n observations is to be taken from a normal distribution 
for which the value of the mean 6 is unknown, the value of the variance is known, 
and the prior distribution of 6 is a specified normal distribution. Then, before any 
observations have been taken, we can use Eq. (7.3.2) to calculate the actual value 
of the variance vi of the posterior distribution. However, the value of the mean p11 
of the posterior distribution will depend on the observed values that are obtained 
in the sample. The fact that the variance of the posterior distribution depends only 
on the number of observations is due to the assumption that the variance o” of the 
individual observations is known. In Sec. 8.6, we shall relax this assumption. 


The Variance of the Posterior Normal Distribution. Suppose that observations are to 
be taken at random from the normal distribution with mean @ and variance 1, and 
that 6 is unknown. Assume that the prior distribution of @ is a normal distribution 
with variance 4. Also, observations are to be taken until the variance of the posterior 
distribution of 6 has been reduced to the value 0.01 or less. We shall determine the 
number of observations that must be taken before the sampling process is stopped. 

It follows from Eq. (7.3.2) that after n observations have been taken, the variance 
ve of the posterior distribution of 6 will be 


4 
a 
4n+1 
Therefore, the relation vy < 0.01 will be satisfied if and only if n > 99.75. Hence, the 


relation ve < 0.01 will be satisfied after 100 observations have been taken and not 
before then. <j 


Calorie Counts on Food Labels. Allison, Heshka, Sepulveda, and Heymsfield (1993) 
sampled 20 nationally prepared foods and compared the stated calorie contents per 


Figure 7.4 Histogram of 
percentage differences be- 
tween observed and ad- 
vertised calories in Exam- 
ple 7.3.10. 
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Number of foods 4 


2 
> 
30 20 10 0 10 20 
Laboratory calories minus label calories 


gram from the labels to calorie contents determined in the laboratory. Figure 7.4 is 
a histogram of the percentage differences between the observed laboratory calorie 
measurements and the advertised calorie contents on the labels of the foods. Suppose 
that we model the conditional distribution of the differences given 6 as the normal 
distribution with mean 6 and variance 100. (In this section, we assume that the 
variance is known. In Sec. 8.6, we will be able to deal with the case in which the 
mean and the variance are treated as random variables with a joint distribution.) We 
will use a prior distribution for 6 that is the normal distribution with mean 0 and a 
variance of 60. The data X¥ comprise the collection of 20 differences in Fig. 7.4, whose 
average is 0.125. The posterior distribution of 6 would then be the normal distribution 
with mean 


_ 100 x 0+ 20 x 60 x 0.125 


= = 0.1154, 
va 100 + 20 x 60 
and variance 
pao ae 
100 + 20 x 60 


For example, we might be interested in whether or not the packagers are system- 
atically understating the calories in their food by at least 1 percent. This would 
correspond to @ > 1. Using Theorem 5.6.6, we can find 


Pr(6 > Ix) =1-© (=) = 1 — (0.4116) = 0.3403. 
V4.62 
There is a nonnegligible, but not overwhelming, chance that the packagers are 
shaving a percent or more off of their labels. < 


Sampling from an Exponential Distribution 


Lifetimes of Electronic Components. In Example 7.2.1, suppose that we observe the 
lifetimes of three components, X; = 3, X, = 1.5, and X3 = 2.1. These were modeled 
as i.i.d. exponential random variables given 6. Our prior distribution for 6 was the 
gamma distribution with parameters 1 and 2. What is the posterior distribution of @ 
given these observed lifetimes? < 
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When sampling from an exponential distribution for which the value of the 
parameter 6 is unknown, the family of gamma distributions serves as a conjugate 
family of prior distributions, as shown in the next theorem. 


Suppose that X;,..., X, form a random sample from the exponential distribution 
with parameter 6 > 0 that is unknown. Suppose also that the prior distribution of 
@ is the gamma distribution with parameters a > 0 and £ > 0. Then the posterior 
distribution of 0 given that X; =x; (i=1,...,n) is the gamma distribution with 
parameters a +n and B + )°"_, x;. 


Proof Again, let y= }°"_, x;. Then the likelihood function f,,(x|9) is 
Lan=ve. 
Also, the prior p.d.f. €(@) has the form 
E(0) x%1eF? for 6 > 0. 
It follows, therefore, that the posterior p.d.f. €(@|x) has the form 
E(O|x) x 9%" -1e- B+) for 6 > 0. 


The right side of this relation can be recognized as being, except for a constant factor, 
the p.d.f. of the gamma distribution with parameters a +n and 6 + y. Therefore, the 
posterior distribution of 6 is as specified in the theorem. a 


The posterior distribution of 6 in Theorem 7.3.4 depends on the observed value 
of the statistic Y = }~”_, X;; hence, every inference about @ based on the posterior 
distribution will depend on the observed value of Y. 


Lifetimes of Electronic Components. In Example 7.3.11, we can apply Theorem 7.3.4 
to find the posterior distribution. In the notation of the theorem and its proof, we 
haven =3,a =1, B =2, and 


n 
y=) x, =3415421=66. 
i=1 
The posterior distribution of 6 is then the gamma distribution with parameters 
a=14+3=4and 8 =2+6.6=8.6. < 


The reader should note that Theorem 7.3.4 would have greatly shortened the 
derivation of the posterior distribution in Example 7.2.6. 


Improper Prior Distributions 


In Sec. 7.2, we mentioned improper priors as expedients that try to capture the 
idea that there is much more information in the data than is captured in our prior 
distribution. Each of the conjugate families that we have seen in this section has an 
improper prior as a limiting case. 


A Clinical Trial. What we illustrate here will apply to all examples in which the data 
comprise a conditionally i.i.d. sample (given 6) from the Bernoulli distribution with 
parameter 6. Consider the subjects in the imipramine group in Example 2.1.4. The 
proportion of successes among all patients who might get imipramine had been called 
P in earlier examples, but let us call it 6 this time in keeping with the general notation 


Figure 7.5 The posterior 
probabilities from Exam- 
ples 2.3.7 (X) and 2.3.8 (bars) 
together with the posterior 
p.d.f. from Example 7.3.13 
(solid line). 
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of this chapter. Suppose that @ has the beta distribution with parameters a and 6, 
a general conjugate prior. There are n = 40 patients in the imipramine group, and 
22 of them are successes. The posterior distribution of 6 is the beta distribution with 
parameters a + 22 and f + 18, as we saw in Theorem 7.3.1. The mean of the posterior 
distribution is (a + 22)/(a + 6 + 40). If a and 6 are small, then the posterior mean 
is close to 22/40, which is the observed proportion of successes. Indeed, if a = 6 = 0, 
which does not correspond to a real beta distribution, then the posterior mean is 
exactly 22/40. However, we can look at what happens as a and £ get close to 0. 
The beta p.d.f. (ignoring the constant factor) is @?—!(1 — 6)8—!. We can set a = B =0 
and pretend that €(0) « 6-11 — 6)~1 is the prior p.d.f. of 6. The likelihood function 
is fag(x|0) = (65)072(1 — 0)!8. We can ignore the constant factor ) and obtain the 
product 


E(6|x) «9741-06, for0<@ <1. 


This is easily recognized as being the same as the p.d.f. of the beta distribution with 
parameters 22 and 18 except for a constant factor. So, if we use the improper “beta 
distribution” prior with prior hyperparameters 0 and 0, we get the beta posterior dis- 
tribution for 6 with posterior hyperparameters 22 and 18. Notice that Theorem 7.3.1 
yields the correct posterior distribution even in this improper prior case. Figure 7.5 
adds the p.d.f. of the posterior beta distribution calculated here to Fig. 2.4 which de- 
picted the posterior probabilities for two different discrete prior distributions. All 
three posteriors are pretty close. < 


Improper Prior. Let € be anonnegative function whose domain includes the parameter 
space of a statistical model. Suppose that { €(6)d0 = oo. If we pretend as if (9) is 
the prior p.d.f. of 6, then we are using an improper prior for 0. 


Definition 7.3.2 is not of much use in determining an improper prior to use in a 
particular application. There are many methods for choosing an improper prior, and 
the hope is that they all lead to similar posterior distributions so that it does not much 
matter which of them one chooses. The most straightforward method for choosing 
an improper prior is to start with the family of conjugate prior distributions, if there 
is such a family. In most cases, if the parameterization of the conjugate family (prior 
hyperparameters) is chosen carefully, the posterior hyperparameters will each equal 
the corresponding prior hyperparameter plus a statistic. One would then replace each 
of those prior hyperparameters by 0 in the formula for the prior p.d.f. This generally 
results in a function that satisfies Definition 7.3.2. In Example 7.3.13, each of the 
posterior hyperparameters were equal to the corresponding prior hyperparameters 
plus some statistic. In that example, we replaced both prior hyperparameters by 
0 to obtain the improper prior. Here are some more examples. The method just 
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described needs to be modified if one chooses an “inconvenient” parameterization 
of the conjugate prior, as in Example 7.3.15 below. 


Prussian Army Deaths. Bortkiewicz (1898) counted the numbers of Prussian soldiers 
killed by horsekick (a more serious problem in the nineteenth century than it is to- 
day) in 14 army units for each of 20 years, a total of 280 counts. The 280 counts have 
the following values: 144 counts are 0, 91 counts are 1, 32 counts are 2, 11 counts are 
3, and 2 counts are 4. No unit suffered more than four deaths by horsekick during 
any one year. (These data were reported and analyzed by Winsor, 1947.) Suppose 
that we were going to model the 280 counts as a random sample of Poisson random 
variables X1,..., X29 with mean @ conditional on the parameter 6. A conjugate 
prior would be a member of the gamma family with prior hyperparameters a and 
B. Theorem 7.3.2 says that the posterior distribution of 9 would be the gamma dis- 
tribution with posterior hyperparameters a + 196 and £ + 280, since the sum of the 
280 counts equals 196. Unless either a or £ is very large, the posterior gamma distri- 
bution is nearly the same as the gamma distribution with posterior hyperparameters 
196 and 280. This posterior distribution would seem to be the result of using a con- 
jugate prior with prior hyperparameters 0 and 0. Ignoring the constant factor, the 
p.d.f. of the gamma distribution with parameters a and f is 0° 'e*? for 6 > 0. If we 
let « = 0 and 6 = 0 in this formula, we get the improper prior “p.d.f.” €(@) = 67! for 
6 > 0. Pretending as if this really were a prior p.d.f. and applying Bayes’ theorem for 
random variables (Theorem 3.6.4) would yield 


E(O|x) « 6! % e—7809) fora > 0. 


This is easily recognized as being the p.d.f. of the gamma distribution with parameters 
196 and 280, except for a constant factor. The result in this example applies to all 
cases in which we model data with Poisson distributions. The improper “gamma 
distribution” with prior hyperparameters 0 and 0 can be used in Theorem 7.3.2, and 
the conclusion will still hold. < 


Failure Times of Ball Bearings. Suppose that we model the 23 logarithms of failure times 
of ball bearings from Example 5.6.9 as normal random variables Xj, ..., X23 with 
mean @ and variance 0.25. A conjugate prior for 6 would be the normal distribution 
with mean jzp and variance vp for some jo and Up. The average of the 23 log-failure 
times is 4.15, so the posterior distribution of 9 would be the normal distribution with 
mean [1 = (0.25j19 + 23 x 4.15up)/(0.25 + 23v4) and variance v7 = (0.25v4)/(0.25 + 
23vp). If we let vp — oo in the formulas for jz, and ve we get 44; > 4.15 and ve — 
0.25/23. Having infinite variance for the prior distribution of 6 is like saying that 0 
is equally likely to be anywhere on the real number line. This same thing happens 
in every example in which we model data X,,..., X,, as arandom sample from the 
normal distribution with mean @ and known variance o” conditional on 6. If we use 
an improper “normal distribution” prior with variance oo (the prior mean does not 
matter), the calculation in Theorem 7.3.3 would yield a posterior distribution that is 
the normal distribution with mean X,, and variance o?/n. The improper prior “p.d.f.” 
in this case is €(@) equal to a constant. 

This example would be an application of the method described after Defini- 
tion 7.3.2 if we had described the conjugate prior distribution in terms of the following 
“more convenient” hyperparameters: 1 over the variance up = 1/ vs and the mean 
over the variance fy = /49/ ve: In terms of these hyperparameters, the posterior dis- 
tribution has 1 over its variance equal to u, = ug + n/0.25 and mean over variance 
equal to t; = by/v4 = ty + 23 x 4.15/0.25. Each of wu; and t, has the form of the cor- 
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responding prior hyperparamter plus a statistic. The improper prior with up = tf) = 0 
also has €(@) equal to a constant. < 


There are improper priors for other sampling models, also. The reader can verify 
(in Exercise 21) that the “gamma distribution” with parameters 0 and 0 leads to 
results similar to those in Example 7.3.14 when the data are a random sample from 
an exponential distribution. Exercises 23 and 24 introduce a general collection of 
p.d.f.’s f(x|0) for which it is easy to construct improper priors. 

Improper priors were introduced for cases in which the observed data contain 
much more information than is represented by our prior distribution. Implicitly, we 
are assuming that the data are rather informative. When the data do not contain 
much information, improper priors may be higly inappropriate. 


Very Rare Events. In Example 5.4.7, we discussed a drinking water contaminant 
known as cryptosporidium that generally occurs in very low concentrations. Suppose 
that a water authority models the oocysts of cryptosporidium in the water supply as 
a Poisson process with rate of 6 oocysts per liter. They decide to sample 25 liters of 
water to learn about 6. Suppose that they use the improper gamma prior with “p.d.f” 
6—!. (This is the same improper prior used in Example 7.3.14.) If the 25-liter sample 
contains no oocysts, the water authority would be led to a posterior distribution 
for 6 that was the gamma distribution with parameters 0 and 5, which is not a real 
distribution. No matter how many liters are sampled, the posterior distribution will 
not be areal distribution until at least one oocyst is observed. When sampling for rare 
events, one might be forced to quantify prior information in the form a proper prior 
distribution in order to be able to make inferences based on the posterior distribution. 

< 


Summary 


For each of several different statistical models for data given the parameter, we 
found a conjugate family of distributions for the parameter. These families have the 
property that if the prior distribution is chosen from the family, then the posterior 
distribution is a member of the family. For data with distributions related to the 
Bernoulli, such as binomial, geometric, and negative binomial, the conjugate family 
for the success probability parameter is the family of beta distributions. For data with 
distributions related to the Poisson process, such as Poisson, gamma (with known first 
parameter), and exponential, the conjugate family for the rate parameter is the family 
of gamma distributions. For data having a normal distribution with known variance, 
the conjugate family for the mean is the normal family. We also described the use 
of improper priors. Improper priors are not true probability distributions, but if we 
pretend that they are, we will compute posterior distributions that approximate the 
posteriors that we would have obtained using proper conjugate priors with extreme 
values of the prior hyperparameters. 


1. Consider again the situation described in Example 2. Show that in Example 7.3.2 it must be true that V < 0.01 
7.3.10. Once again, suppose that the prior distribution of after 22 items have been selected. Also show that V > 0.01 
6 is a normal distribution with mean 0, but this time let until at least seven items have been selected. 

the prior variance be v” > 0. If the posterior mean of 0 is 

0.12, what value of v? was used? 3. Suppose that the proportion 6 of defective items in a 


large shipment is unknown and that the prior distribution 
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of @ is the beta distribution with parameters 2 and 200. If 
100 items are selected at random from the shipment and 
if three of these items are found to be defective, what is 
the posterior distribution of 6? 


4. Consider again the conditions of Exercise 3. Suppose 
that after a certain statistician has observed that there 
were three defective items among the 100 items selected 
at random, the posterior distribution that she assigns to 0 
is a beta distribution for which the mean is 2/51 and the 
variance is 98/[(51)?(103)]. What prior distribution had 
the statistician assigned to 6? 


5. Suppose that the number of defects in a 1200-foot roll 
of magnetic recording tape has a Poisson distribution for 
which the value of the mean @ is unknown and that the 
prior distribution of 6 is the gamma distribution with pa- 
rameters a = 3 and § = 1. When five rolls of this tape are 
selected at random and inspected, the numbers of defects 
found on the rolls are 2, 2, 6, 0, and 3. Determine the pos- 
terior distribution of 0. 


6. Let @ denote the average number of defects per 100 
feet of a certain type of magnetic tape. Suppose that the 
value of 6 is unknown and that the prior distribution of 
@ is the gamma distribution with parameters a = 2 and 
6 =10. When a 1200-foot roll of this tape is inspected, 
exactly four defects are found. Determine the posterior 
distribution of 0. 


7. Suppose that the heights of the individuals in a certain 
population have a normal distribution for which the value 
of the mean @ is unknown and the standard deviation is 
2 inches. Suppose also that the prior distribution of 6 is a 
normal distribution for which the mean is 68 inches and 
the standard deviation is 1 inch. If 10 people are selected 
at random from the population, and their average height is 
found to be 69.5 inches, what is the posterior distribution 
of 6? 


8. Consider again the problem described in Exercise 7. 


a. Which interval 1-inch long had the highest prior 
probability of containing the value of 6? 


b. Which interval 1-inch long has the highest posterior 
probability of containing the value of 6? 


c. Find the values of the probabilities in parts (a) and 
(b). 

9. Suppose that a random sample of 20 observations is 
taken from a normal distribution for which the value of the 
mean @ is unknown and the variance is 1. After the sample 
values have been observed, it is found that X,, = 10, and 
that the posterior distribution of 6 is a normal distribution 
for which the mean is 8 and the variance is 1/25. What was 
the prior distribution of 6? 


10. Suppose that a random sample is to be taken from 
a normal distribution for which the value of the mean 
6 is unknown and the standard deviation is 2, and the 
prior distribution of 6 is a normal distribution for which 


the standard deviation is 1. What is the smallest number 
of observations that must be included in the sample in 
order to reduce the standard deviation of the posterior 
distribution of 6 to the value 0.1? 


11. Suppose that a random sample of 100 observations is 
to be taken from a normal distribution for which the value 
of the mean @ is unknown and the standard deviation is 
2, and the prior distribution of 6 is a normal distribution. 
Show that no matter how large the standard deviation 
of the prior distribution is, the standard deviation of the 
posterior distribution will be less than 1/5. 


12. Suppose that the time in minutes required to serve a 
customer at a certain facility has an exponential distribu- 
tion for which the value of the parameter 6 is unknown 
and that the prior distribution of 6 is a gamma distribu- 
tion for which the mean is 0.2 and the standard deviation 
is 1. If the average time required to serve a random sam- 
ple of 20 customers is observed to be 3.8 minutes, what is 
the posterior distribution of 6? 


13. For a distribution with mean jz 4 0 and standard devi- 
ation o > 0, the coefficient of variation of the distribution 
is defined as o/|u|. Consider again the problem described 
in Exercise 12, and suppose that the coefficient of varia- 
tion of the prior gamma distribution of @ is 2. What is the 
smallest number of customers that must be observed in or- 
der to reduce the coefficient of variation of the posterior 
distribution to 0.1? 


14. Show that the family of beta distributions is a con- 
jugate family of prior distributions for samples from a 
negative binomial distribution with a known value of the 
parameter r and an unknown value of the parameter p 
(0 < p <1). 


15. Let €(@) be a p.d-f. that is defined as follows for con- 
stants a > 0 and 6 > 0: 


a 


<0) =| Tay 
lo 


g-@+De-B/6 fora > 0, 


for 6 <0. 


A distribution with this p.d-f. is called an inverse gamma 
distribution. 


a. Verify that €(@) is actually a p.d.f. by verifying that 
fo €@) ae =1. 

b. Consider the family of probability distributions that 
can be represented by a p.d.f. €(@) having the given 
form for all possible pairs of constants a > 0 and 6B > 
0. Show that this family is a conjugate family of prior 
distributions for samples from a normal distribution 
with a known value of the mean yw and an unknown 
value of the variance 6. 


16. Suppose that in Exercise 15 the parameter is taken as 
the standard deviation of the normal distribution, rather 
than the variance. Determine a conjugate family of prior 
distributions for samples from a normal distribution with 


a known value of the mean pw and an unknown value of 
the standard deviation o. 


17. Suppose that the number of minutes a person must 
wait for a bus each morning has the uniform distribution 
on the interval [0, 6], where the value of the endpoint 6 
is unknown. Suppose also that the prior p.d.f. of 6 is as 
follows: 


192 
£(0) = ae for 6 a 
0 otherwise. 
If the observed waiting times on three successive mornings 
are 5, 3, and 8 minutes, what is the posterior p.d.f. of 6? 


18. The Pareto distribution with parameters xg and a 
(xo > 0 and aw > 0) is defined in Exercise 16 of Sec. 5.7. 
Show that the family of Pareto distributions is a conjugate 
family of prior distributions for samples from a uniform 
distribution on the interval [0, 0], where the value of the 
endpoint 6 is unknown. 


19. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d.f. f(x|@) is as follows: 


F (x10) = 6x°1 for0<x <1, 
0 otherwise. 


Suppose also that the value of the parameter 6 is unknown 
(9 > 0), and the prior distribution of @ is the gamma dis- 
tribution with parameters a and 6 (a > 0 and B > 0). De- 
termine the mean and the variance of the posterior distri- 
bution of 6. 


20. Suppose that we model the lifetimes (in months) of 
electronic components as independent exponential ran- 
dom variables with unknown parameter 6. We model f 
as having the gamma distribution with parameters a and 
b. We believe that the mean lifetime is four months before 
we see any data. If we were to observe 10 components with 
an average observed lifetime of six months, we would then 
claim that the mean lifetime is five months. Determine a 
and b. Hint: Use Exercise 21 in Sec. 5.7. 


21. Suppose that X;,..., X,, forma random sample from 
the exponential distribution with parameter 6. Let the 
prior distribution of 6 be improper with “p.d.f.” 1/6 for 
@ > 0. Find the posterior distribution of 6 and show that 
the posterior mean of 0 is 1/x,,. 


22. Consider the data in Example 7.3.10. This time, sup- 
pose that we use the improper prior “p.d.f.” €(@) = 1 (for 
all@). Find the posterior distribution of 6 and the posterior 
probability that 6 > 1. 


23. Consider a distribution for which the p.d.f. or the p.f. 
is f(x|@), where 6 belongs to some parameter space Q. It 
is said that the family of distributions obtained by letting 
@ vary over all values in Q is an exponential family, or 
a Koopman-Darmois family, if f(x|@) can be written as 
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follows for 6 € Q and all values of x: 
f (x|@) = a(@)b(x) exp[c(@) d(x)]. 


Here a(@) and c(@) are arbitrary functions of 6, and b(x) 
and d(x) are arbitrary functions of x. Let 


H= {a B) , a(0)% exp[c(@) Bldo < 2 
Q 


For each (a, 8) € H, let 
a(0)* exp[c(9) B] 
Sg. a(n) exp[e(n) B]dn’ 


and let © be the set of all probability distributions that 
have p.d.f’s of the form ¢, ,(0) for some (a, B) € H. 


E,,p(0) = 


a. Show that W is a conjugate family of prior distribu- 
tions for samples from f (x|@). 


b. Suppose that we observe a random sample of size n 
from the distribution with p.d.-f. f(x|0). If the prior 
p.d.f. of @ is €y,,4,, Show that the posterior hyperpa- 
rameters are 


ay =a) +n, By = By + D> d(x). 
i=1 
24. Show that each of the following families of distribu- 
tions is an exponential family, as defined in Exercise 23: 


a. The family of Bernoulli distributions with an un- 
known value of the parameter p 


b. The family of Poisson distributions with an unknown 
mean 


c. The family of negative binomial distributions for 
which the value of r is known and the value of p 
is unknown 


d. The family of normal distributions with an unknown 
mean and a known variance 


e. The family of normal distributions with an unknown 
variance and a known mean 


f. The family of gamma distributions for which the 
value of a is unknown and the value of 6 is known 


g. The family of gamma distributions for which the 
value of a is known and the value of 6 is unknown 


h. The family of beta distributions for which the value 
of a is unknown and the value of 6 is known 


i. The family of beta distributions for which the value 
of a is known and the value of 6 is unknown 


25. Show that the family of uniform distributions on the 
intervals [0, 6] for 6 > 0 is not an exponential family as 
defined in Exercise 23. Hint: Look at the support of each 
uniform distribution. 


26. Show that the family of discrete uniform distributions 
on the sets of integers {0, 1,..., 6} for 6 a nonnegative 
integer is not an exponential family as defined in Exer- 
cise 23. 
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7.4 Bayes Estimators 


An estimator of a parameter is some function of the data that we hope is close to 
the parameter. A Bayes estimator is an estimator that is chosen to minimize the 
posterior mean of some measure of how far the estimator is from the parameter, 
such as squared error or absolute error. 


Nature of an Estimation Problem 


Calorie Counts on Food Labels. In Example 7.3.10, we found the posterior distribution 
of 6, the mean percentage difference between measured and advertised calorie 
counts. A consumer group might wish to report a single number as an estimate of 0 
without specifying the entire distribution for 6. How to choose such a single-number 
estimate in general is the subject of this section. 4 


We begin with a definition that is appropriate for a real-valued parameter such 
as in Example 7.4.1. A more general definition will follow after we become more 
familiar with the concept of estimation. 


Estimator/Estimate. Let X,,..., X, be observable data whose joint distribution is 
indexed by a parameter 6 taking values in a subset Q of the real line. An estimator 
of the parameter 6 is a real-valued function 6(Xy,..., X,). If X, =24,..., Xn, =%Xpy 
are observed, then (x1, ..., x,) is called the estimate of 0. 


Notice that every estimator is, by nature of being a function of data, a statistic in the 
sense of Definition 7.1.4. 

Because the value of 6 must belong to the set Q, it might seem reasonable to 
require that every possible value of an estimator 5(X,,..., X,,) must also belong 
to Q. We shall not require this restriction, however. If an estimator can take values 
outside of the parameter space Q, the experimenter will need to decide in the specific 
problem whether that seems appropriate or not. It may turn out that every estimator 
that takes values only inside Q has other even less desirable properties. 

In Definition 7.4.1, we distinguished between the terms estimator and estimate. 
Because an estimator 6(Xj, ..., X,,)isafunction of the random variables X;,..., X), 
the estimator itself is a random variable, and its probability distribution can be 
derived from the joint distribution of X;,..., X,, if desired. On the other hand, an 
estimate is a specific value 5(x1, ..., x,,) of the estimator that is determined by using 
specific observed values x1, ..., x,. If we use the vector notation X = (X,..., X,) 
andx = (xj,..., X,), then an estimator is a function 6(X) of the random vector X, and 
an estimate is a specific value 5(x). It will often be convenient to denote an estimator 
5(X) simply by the symbol 6. 


Loss Functions 


Calorie Counts on Food Labels. In Example 7.4.1, the consumer group may feel that the 
farther their estimate 6 (x) is from the true mean difference 6, the more embarassment 
and possible legal action they will encounter. Ideally, they would like to quantify the 
amount of negative repercussions as a function of @ and the estimate 5(x). Then they 
could have some idea how likely it is that they will encounter various levels of hassle 
as a result of their estimation. < 
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The foremost requirement of a good estimator 4 is that it yield an estimate of 
@ that is close to the actual value of @. In other words, a good estimator is one for 
which it is highly probable that the error 6(X) — @ will be close to 0. We shall assume 
that for each possible value of 6 € Q and each possible estimate a, there is a number 
L(@, a) that measures the loss or cost to the statistician when the true value of the 
parameter is @ and her estimate is a. Typically, the greater the distance between a 
and 0, the larger will be the value of L(@, a). 


Loss Function. A loss function is a real-valued function of two variables, L(0, a), 
where @ € Q and a is a real number. The interpretation is that the statistician loses 
L(6, a) if the parameter equals @ and the estimate equals a. 


As before, let €(@) denote the prior p.d-f. of @ on the set Q, and consider a problem 
in which the statistician must estimate the value of 6 without being able to observe 
the values in a random sample. If the statistician chooses a particular estimate a, then 
her expected loss will be 


E[L(@, a)]= a L(6, a)é(@) do. (7.4.1) 


We shall assume that the statistician wishes to choose an estimate a for which the 
expected loss in Eq. (7.4.1) is a minimum. 


Definition of a Bayes Estimator 


Suppose now that the statistician can observe the value x of the random vector X 
before estimating 0, and let (|x) denote the posterior p.d.f. of @ on . (The case of 
a discrete parameter can be handled in similar fashion.) For each estimate a that the 
statistician might use, her expected loss in this case will be 


E[L(6, a)|x] = if L(6, a)é(6|x) dd. (7.4.2) 


Hence, the statistician should now choose an estimate a for which the expectation in 
Eq. (7.4.2) is a minimum. 

For each possible value x of the random vector X, let 6*(x) denote a value of 
the estimate a for which the expected loss in Eq. (7.4.2) is a minimum. Then the 
function 6*(X) for which the values are specified in this way will be an estimator of 
0. 


Bayes Estimator/Estimate. Let L(0, a) be a loss function. For each possible value x of 
X, let 6*(x) be a value of a such that E[L(@, a)|x] is minimized. Then 6* is called a 
Bayes estimator of 6. Once X = x is observed, 5*(x) is called a Bayes estimate of 6. 


Another way to describe a Bayes estimator 6* is to note that, for each possible value 
x of X, the value 5*(x) is chosen so that 


E[L(6, 6*(x)) |x] = min E[L(, a)|x]. (7.4.3) 


In summary, we have considered an estimation problem in which a random sam- 
ple X = (Xj, ..., X,) is to be taken from a distribution involving a parameter 6 that 
has an unknown value in some specified set Q. For every given loss function L(6, a) 
and every prior p.d.f. €(@), the Bayes estimator of 6 is the estimator 5*(X) for which 
Eq. (7.4.3) is satisfied for every possible value x of X. It should be emphasized that 
the form of the Bayes estimator will depend on both the loss function that is used 
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in the problem and the prior distribution that is assigned to 6. In the problems de- 
scribed in this text, Bayes estimators will exist. However, there are more complicated 
situations in which no function 6* satisfies (7.4.3). 


Different Loss Functions 


By far, the most commonly used loss function in estimation problems is the squared 
error loss function. 


Squared Error Loss Function. The loss function 
L(@6, a) =(0 —a)’ (7.4.4) 


is called squared error loss. 


When the squared error loss function is used, the Bayes estimate 5*(x) for each 
observed value of x will be the value of a for which the expectation E[(@ — a)*|x]is a 
minimum. Theorem 4.7.3 states that, when the expectation of (6 — a)? is calculated 
with respect to the posterior distribution of 6, this expectation will be a minimum 
when a is chosen to be equal to the mean E(6|x) of the posterior distribution, if that 
posterior mean is finite. If the posterior mean of @ is not finite, then the expected loss 
is infinite for every possible estimate a. Hence, we have the following corollary to 
Theorem 4.7.3. 


Let 6 be a real-valued parameter. Suppose that the squared error loss function (7.4.4) 
is used and that the posterior mean of 0, E(6|X), is finite. Then, a Bayes estimator 
of 6 is 6*(X) = E(O|X). | 


Estimating the Parameter of a Bernoulli Distribution. Let the random sample X,,..., X, 
be taken from the Bernoulli distribution with parameter 6, which is unknown and 
must be estimated. Let the prior distribution of 6 be the beta distribution with 
parameters a > 0 and f > 0. Suppose that the squared error loss function is used, 
as specified by Eq. (7.4.4), for 0 < @ < Land 0 <a < 1. We shall determine the Bayes 
estimator of 6. 

For observed values x1, ...,X,, let y = }0/_, x;. Then it follows from Theo- 
rem 7.3.1 that the posterior distribution of @ will be the beta distribution with pa- 
rameters a; =a + yand f; = 8 +n — y. Since the mean of the beta distribution with 
parameters a and f, is a,/(a, + 6,), the mean of this posterior distribution of 6 will 
be (a + y)/(a + 6 +n). The Bayes estimate 6(x) will be equal to this value for each 
observed vector x. Therefore, the Bayes estimator 5*(X) is specified as follows: 


a vet X; 


os oo a+ B4+n : 


(7.4.5) 
< 


Estimating the Mean of a Normal Distribution. Suppose that a random sample Xj, ..., 
X,, is to be taken from a normal distribution for which the value of the mean @ is 
unknown and the value of the variance o” is known. Suppose also that the prior 
distribution of @ is the normal distribution with mean jg and variance up Suppose, 
finally, that the squared error loss function is to be used, as specified in Eq. (7.4.4), 
for —oo < 8 < oo and —oo <a < oo. We shall determine the Bayes estimator of 6. 
It follows from Theorem 7.3.3 that for all observed values x1, ..., x,, the pos- 
terior distribution of 6 will be a normal distribution with mean jp, specified by 
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Eq. (7.3.1). Therefore, the Bayes estimator 5*(X) is specified as follows: 


oY + nveXn 


§*(X) = (7.4.6) 


2 2 
(on + NU 


The posterior variance of 6 does not enter into this calculation. <1 


Another commonly used loss function in estimation problems is the absolute 
error loss function. 


Absolute Error Loss Function. The loss function 
L(6, a) =|6 —al (7.4.7) 


is called absolute error loss. 


For every observed value of x, the Bayes estimate 5*(x) will now be the value of a 
for which the expectation E(|@ — a||x) is a minimum. It was shown in Theorem 4.5.3 
that for every given probability distribution of 0, the expectation of |6 — a| will be a 
minimum when a is chosen to be equal to a median of the distribution of 6. Therefore, 
when the expectation of |6 — a| is calculated with respect to the posterior distribution 
of 6, this expectation will be a minimum when a is chosen to be a median of the 
posterior distribution of 6. 


When the absolute error loss function (7.4.7) is used, a Bayes estimator of a real- 
valued parameter is 6*(X) equal to a median of the posterior distribution of 6. 


We shall now reconsider Examples 7.4.3 and 7.4.4, but we shall use the absolute 
error loss function instead of the squared error loss function. 


Estimating the Parameter of a Bernoulli Distribution. Consider again the conditions 
of Example 7.4.3, but suppose now that the absolute error loss function is used, 
as specified by Eq. (7.4.7). For all observed values x;,...,x,, the Bayes estimate 
5*(x) will be equal to the median of the posterior distribution of 6, which is the beta 
distribution with parameters a + y and 6 +n — y. There is no simple expression for 
this median. It must be determined by numerical approximations for each given set 
of observed values. Most statistical computer software can compute the median of 
an arbitrary beta distribution. 

As a specific example, consider the situation described in Example 7.3.13 in 
which an improper prior was used. The posterior distribution of @ in that example was 
the beta distribution with parameters 22 and 18. The mean of this beta distribution 
is 22/40 = 0.55. The median is 0.5508. < 


Estimating the Mean of a Normal Distribution. Consider again the conditions of Exam- 
ple 7.4.4, but suppose now that the absolute error loss function is used, as specified 
by Eq. (7.4.7). For all observed values x1,...,x,, the Bayes estimate 5*(x) will be 
equal to the median of the posterior normal distribution of 6. However, since the 
mean and the median of each normal distribution are equal, 5*(x) is also equal to 
the mean of the posterior distribution. Therefore, the Bayes estimator with respect 
to the absolute error loss function is the same as the Bayes estimator with respect to 
the squared error loss function, and it is again given by Eq. (7.4.6). < 


412 


Chapter 7 Estimation 


Example 
1.4.7 


Other Loss Functions Although the squared error loss function and, to a lesser 
extent, the absolute error loss function are the most commonly used ones in esti- 
mation problems, neither of these loss functions may be appropriate in a particular 
problem. In some problems, it might be appropriate to use a loss function having the 
form L(@, a) = |6 — a|*, where k is some positive number other than 1 or 2. In other 
problems, the loss that results when the error |# — a| has a given magnitude might 
depend on the actual value of 6. In such a problem, it might be appropriate to use a 
loss function having the form L(6, a) =A(6)(O — a)’ or L(6, a) = A(6)|@ — al, where 
A(8) is a given positive function of @. In still other problems, it might be more costly 
to overestimate the value of @ by a certain amount than to underestimate it by the 
same amount. One specific loss function that reflects this property is as follows: 


3(0 —a)? for @ <a, 
(@—a)* for@>a. 


Various other types of loss functions might be relevant in specific estimation 
problems. However, in this book we shall give most of our attention to the squared 
error and absolute error loss functions. 


L(6, a) = 


The Bayes Estimate for Large Samples 


Effect of Different Prior Distributions Suppose that the proportion @ of defective 
items in a large shipment is unknown and that the prior distribution of 6 is the uniform 
distribution on the interval [0, 1]. Suppose also that the value of 6 must be estimated, 
and that the squared error loss function is used. Suppose, finally, that in a random 
sample of 100 items from the shipment, exactly 10 items are found to be defective. 
Since the uniform distribution is the beta distribution with parameters a = 1 and 
8 =1, and since n = 100 and y = 10 for the given sample, it follows from Eq. (7.4.5) 
that the Bayes estimate is 6*(x) = 11/102 = 0.108. 

Next, suppose that the prior p.d.f. of 6 has the form €(0) = 2(1 — 6) for0 <@ <1, 
instead of being a uniform distribution, and that again in a random sample of 100 
items, exactly 10 items are found to be defective. Since €(6) is the p.d.f. of the beta 
distribution with parameters a = 1 and 6 = 2, it follows from Eq. (7.4.5) that in this 
case the Bayes estimate of 6 is 6(x) = 11/103 = 0.107. 

The two prior distributions considered here are quite different. The mean of the 
uniform prior distribution is 1/2, and the mean of the other beta prior distribution 
is 1/3. Nevertheless, because the number of observations in the sample is so large 
(n = 100), the Bayes estimates with respect to the two different prior distributions 
are almost the same. Furthermore, the values of both estimates are very close to the 
observed proportion of defective items in the sample, which is x,, = 0.1. 


Chest Measurements of Scottish Soldiers. Quetelet (1846) reported (with some errors) 
data on the chest measurements (in inches) of 5732 Scottish militiamen. These data 
appeared earlier in an 1817 medical journal and are discussed by Stigler (1986). Fig- 
ure 7.6 shows a histogram of the data. Suppose that we were to model the individual 
chest measurements as a random sample (given @) of normal random variables with 
mean @ and variance 4. The average chest measurement is x, = 39.85. If 6 had the 
normal prior distribution with mean jp and variance ve, then using Eq. (7.3.1) the 
posterior distribution of 6 would be normal with mean 


4g + 5732 x vp x 39.85 
MV 445732 x ve 


Figure 7.6 Histogram 
of chest measurements 
of Scottish militiamen in 
Example 7.4.7. 
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The Bayes estimate will then be 5(x) = 11. Notice that, unless jg is incredibly large or 
v5 is very small, we will have jz; nearly equal to 39.85 and nearly equal to 4/5732. 
Indeed, if the prior p.d.f. of @ is any continuous function that is positive around 
6 = 39.85 and is not extremely large when 6 is far from 39.85, then the posterior 
p.d.f. of @ will very nearly be the normal p.d.f. with mean 39.85 and variance 4/5732. 
The mean and median of the posterior distribution are nearly x, regardless of the 


prior distribution. S| 


Consistency of the Bayes Estimator Let X,..., X, be arandom sample (given @) 
from the Bernoulli distribution with parameter 0. Suppose that we use a conjugate 
prior for 0. Since @ is the mean of the distribution from which the sample is being 
taken, it follows from the law of large numbers discussed in Sec. 6.2 that X,, converges 
in probability to 6 as n — oo. Since the difference between the Bayes estimator 5*(X) 
and X,, converges in probability to 0 as n — oo, it can also be concluded that 5*(X) 
converges in probability to the unknown value of 6 as n > oo. 


Consistent Estimator. A sequence of estimators that converges in probability to the 
unknown value of the parameter being estimated, as n > ov, is called a consistent 
sequence of estimators. 


Thus, we have shown that the Bayes estimators 5*(X) form a consistent sequence of 
estimators in the problem considered here. The practical interpretation of this result 
is as follows: When large numbers of observations are taken, there is high probability 
that the Bayes estimator will be very close to the unknown value of 6. 

The results that have just been presented for estimating the parameter of a 
Bernoulli distribution are also true for other estimation problems. Under fairly 
general conditions and for a wide class of loss functions, the Bayes estimators of 
some parameters 6 will form a consistent sequence of estimators as the sample size 
n — oo. In particular, for random samples from any one of the various families of 
distributions discussed in Sec. 7.3, if a conjugate prior distribution is assigned to the 
parameters and the squared error loss function is used, the Bayes estimators will 
form a consistent sequence of estimators. 

For example, consider again the conditions of Example 7.4.4. In that example, a 
random sample is taken from a normal distribution for which the value of the mean 
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@ is unknown, and the Bayes estimator 5*(X) is specified by Eq. (7.4.6). By the law 
of large numbers, X,, will converge to the unknown value of the mean @ as n > oo. It 
can now be seen from Eq. (7.4.6) that 5*(X) will also converge to 6 as n — oo. Thus, 
the Bayes estimators again form a consistent sequence of estimators. Other examples 
are given in Exercises 7 and 11 at the end of this section. 


More General Parameters and Estimators 


So far in this section, we have considered only real-valued parameters and estima- 
tors of those parameters. There are two very common generalizations of this situation 
that are easy to handle with the same techniques described above. The first general- 
ization is to multidimensional parameters such as the two-dimensional parameter of 
a normal distribution with unknown mean and variance. The second generalization 
is to functions of the parameter rather than the parameter itself. For example, if 6 is 
the failure rate in Example 7.1.1, we might be interested in estimating 1/0, the mean 
time to failure. As another example, if our data arise from a normal distribution with 
unknown mean and variance, we might wish to estimate the mean only rather than 
the entire parameter. 

The necessary changes to Definition 7.4.1 in order to handle both of the gener- 
alizations just mentioned are given in Definition 7.4.7. 


Estimator/Estimate. Let X),..., X, be observable data whose joint distribution is 
indexed by a parameter @ taking values in a subset Q of k-dimensional space. Let 
h be a function from Q into d-dimensional space. Define y = (0). An estimator 
of y is a function 6(X 1, ..., X,,) that takes values in d-dimensional space. If X; = 
Xy,..., X, =x, are observed, then 5(x1,..., x,) is called the estimate of wy. 


When / in Definition 7.4.7 is the identity function h(0) = 0, then yw = 6 and we are 
estimating the original parameter 6. When /(@) is one coordinate of 6, then the w 
that we are estimating is just that one coordinate. 

There will be a number of examples of multidimensional parameters in later 
sections and chapters of this book. Here is an example of estimating a function of a 
parameter. 


Lifetimes of Electronic Components. In Example 7.3.12, suppose that we want to esti- 
mate y = 1/6, the mean time to failure of the electronic components. The posterior 
distribution of @ is the gamma distribution with parameters 4 and 8.6. If we use the 
squared error loss L(9, a) = (Ww — a)”, Theorem 4.7.3 says that the Bayes estimate is 
the mean of the posterior distribution of y. That is, 


sa) = Bis) = E (7 *) 
oe 
=| =E(6|x)d0 
0 O 
= [> SS ore-seag 
0 8 6 
ea 7 
=e 6 
_ 864 2 
~ 6 8.63 


= 2.867, 
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where the final equality follows from Theorem 5.7.3. The mean of 1/6 is slightly higher 
than 1/E(6|x) = 8.6/4 = 2.15. 4 


Note: Loss Functions and Utility. In Sec. 4.8, we introduced the concept of utility 
to measure the values to a decision maker of various random outcomes. The concept 
of loss function is closely related to that of utility. In a sense, a loss function is like 
the negative of a utility. Indeed, Example 4.8.8 shows how to convert absolute error 
loss into a utility. In that example, Y plays the role of the parameter and d(W) plays 
the role of the estimator. In a similar manner, one can convert other loss functions 
into utilities. Hence, it is not surprising that the goal of maximizing expected utility 
in Sec. 4.8 has been replaced by the goal of minimizing expected loss in the present 
section. 


Limitations of Bayes Estimators 


The theory of Bayes estimators, as described in this section, provides a satisfactory 
and coherent theory for the estimation of parameters. Indeed, according to statisti- 
cians who adhere to the Bayesian philosophy, it provides the only coherent theory of 
estimation that can possibly be developed. Nevertheless, there are certain limitations 
to the applicability of this theory in practical statistical problems. To apply the the- 
ory, it is necessary to specify a particular loss function, such as the squared error or 
absolute error function, and also a prior distribution for the parameter. Meaningful 
specifications may exist, in principle, but it may be very difficult and time-consuming 
to determine them. In some problems, the statistician must determine the specifi- 
cations that would be appropriate for clients or employers who are unavailable or 
otherwise unable to communicate their preferences and knowledge. In other prob- 
lems, it may be necessary for an estimate to be made jointly by members of a group or 
committee, and it may be difficult for the members of the group to reach agreement 
about an appropriate loss function and prior distribution. 

Another possible difficulty is that in a particular problem the parameter 6 may 
actually be a vector of real-valued parameters for which all the values are unknown. 
The theory of Bayes estimation, which has been developed in the preceding sections, 
can easily be generalized to include the estimation of a vector parameter 0. However, 
to apply this theory in such a problem it is necessary to specify a multivariate prior 
distribution for the vector 6 and also to specify a loss function L(0, a) that is a function 
of the vector @ and the vector a, which will be used to estimate 6. Even though 
the statistician may be interested in estimating only one or two components of the 
vector @ in a given problem, he must still assign a multivariate prior distribution to 
the entire vector 6. In many important statistical problems, some of which will be 
discussed later in this book, 9 may have a large number of components. In such a 
problem, it is especially difficult to specify a meaningful prior distribution on the 
multidimensional parameter space Q. 

It should be emphasized that there is no simple way to resolve these difficulties. 
Other methods of estimation that are not based on prior distributions and loss 
functions typically have practical limitations, also. These other methods also typically 


have serious defects in their theoretical structure as well. 


Summary 


An estimator of a parameter 0 is a function 6 of the data X. If X = x is observed, the 
value 5(x) is called our estimate, the observed value of the estimator 5(X). A loss 
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function L(6, a) is designed to measure how costly it is to use the value a to estimate 
0. A Bayes estimator 5*(X) is chosen so that a = 5*(x) provides the minimum value 
of the posterior mean of L(6, a). That is, 


E[L(, 5*(x)) [x] = min E[L@, a)|x] 


If the loss is squared error, L(@, a) = (6 — a), then 5*(x) is the posterior mean of 
0, E(@|x). If the loss is absolute error, L(@, a) = |6 — a|, then 6*(x) is a median of 
the posterior distribution of 9. For other loss functions, locating the minimum might 


have to be done numerically. 


Exercises 


1. In a clinical trial, let the probability of successful out- 
come @ have a prior distribution that is the uniform dis- 
tribution on the interval [0, 1], which is also the beta dis- 
tribution with parameters 1 and 1. Suppose that the first 
patient has a successful outcome. Find the Bayes estimates 
of @ that would be obtained for both the squared error and 
absolute error loss functions. 


2. Suppose that the proportion 6 of defective items in a 
large shipment is unknown, and the prior distribution of 
@ is the beta distribution for which the parameters are 
a = Sand f = 10. Suppose also that 20 items are selected at 
random from the shipment, and that exactly one of these 
items is found to be defective. If the squared error loss 
function is used, what is the Bayes estimate of 6? 


3. Consider again the conditions of Exercise 2. Suppose 
that the prior distribution of 6 is as given in Exercise 2, 
and suppose again that 20 items are selected at random 
from the shipment. 


a. For what number of defective items in the sample 
will the mean squared error of the Bayes estimate be 
a maximum? 


b. For what number will the mean squared error of the 
Bayes estimate be a minimum? 


4. Suppose that a random sample of size n is taken from 
the Bernoulli distribution with parameter 0, which is un- 
known, and that the prior distribution of 6 is a beta distri- 
bution for which the mean is zg. Show that the mean of 
the posterior distribution of 6 will be a weighted average 
having the form y,,X,, + (1 — ¥,) 49, and show that y,, > 1 
as n > 00. 


5. Suppose that the number of defects in a 1200-foot roll 
of magnetic recording tape has a Poisson distribution for 
which the value of the mean @ is unknown, and the prior 
distribution of @ is the gamma distribution with param- 
eters a =3 and 6 = 1. When five rolls of this tape are 
selected at random and inspected, the numbers of defects 
found on the rolls are 2, 2, 6, 0, and 3. If the squared error 


loss function is used, what is the Bayes estimate of 0? (See 
Exercise 5 of Sec. 7.3.) 


6. Suppose that a random sample of size n is taken from 
a Poisson distribution for which the value of the mean 6 is 
unknown, and the prior distribution of 6 is a gamma dis- 
tribution for which the mean is jzg. Show that the mean of 
the posterior distribution of 6 will be a weighted average 
having the form y,,X,, + (1 — y,){49, and show that y,, > 1 
as n — 00. 


7. Consider again the conditions of Exercise 6, and sup- 
pose that the value of 6 must be estimated by using the 
squared error loss function. Show that the Bayes estima- 
tors, forn =1,2,..., form a consistent sequence of esti- 
mators of 6. 


8. Suppose that the heights of the individuals in a certain 
population have a normal distribution for which the value 
of the mean @ is unknown and the standard deviation is 
2 inches. Suppose also that the prior distribution of 6 is a 
normal distribution for which the mean is 68 inches and 
the standard deviation is 1 inch. Suppose finally that 10 
people are selected at random from the population, and 
their average height is found to be 69.5 inches. 


a. Ifthe squared error loss function is used, what is the 
Bayes estimate of 6? 


b. Ifthe absolute error loss function is used, what is the 
Bayes estimate of 6? (See Exercise 7 of Sec. 7.3). 


9. Suppose that a random sample is to be taken from a 
normal distribution for which the value of the mean @ is 
unknown and the standard deviation is 2, the prior distri- 
bution of 6 is a normal distribution for which the standard 
deviation is 1, and the value of 6 must be estimated by us- 
ing the squared error loss function. What is the smallest 
random sample that must be taken in order for the mean 
squared error of the Bayes estimator of 6 to be 0.01 or 
less? (See Exercise 10 of Sec. 7.3.) 


10. Suppose that the time in minutes required to serve a 
customer at a certain facility has an exponential distribu- 
tion for which the value of the parameter 6 is unknown, 


the prior distribution of 6 is a gamma distribution for 
which the mean is 0.2 and the standard deviation is 1, and 
the average time required to serve a random sample of 
20 customers is observed to be 3.8 minutes. If the squared 
error loss function is used, what is the Bayes estimate of 
6? (See Exercise 12 of Sec. 7.3.) 


11. Suppose that a random sample of size n is taken from 
an exponential distribution for which the value of the 
parameter @ is unknown, the prior distribution of 6 is 
a specified gamma distribution, and the value of 6 must 
be estimated by using the squared error loss function. 
Show that the Bayes estimators, forn =1,2,..., forma 
consistent sequence of estimators of 0. 


12. Let 6 denote the proportion of registered voters in a 
large city who are in favor of a certain proposition. Sup- 
pose that the value of 6 is unknown, and two statisticians 
A and B assign to @ the following different prior p.d.f’s 
&,(@) and (0), respectively: 


E,4(0) =20 for0<6@ <1, 
En(0) =40° for0<6 <1. 


Inarandom sample of 1000 registered voters from the city, 
it is found that 710 are in favor of the proposition. 


a. Find the posterior distribution that each statistician 
assigns to 0. 


b. Find the Bayes estimate for each statistician based 
on the squared error loss function. 


c. Show that after the opinions of the 1000 registered 
voters in the random sample had been obtained, the 
Bayes estimates for the two statisticians could not 
possibly differ by more than 0.002, regardless of the 
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number in the sample who were in favor of the prop- 
osition. 


13. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution on the interval [0, 6], where the 
value of the parameter @ is unknown. Suppose also that 
the prior distribution of 6 is the Pareto distribution with 
parameters x9 and aw (x) > 0 and a > 0), as defined in 
Exercise 16 of Sec. 5.7. If the value of 6 is to be estimated 
by using the squared error loss function, what is the Bayes 
estimator of 0? (See Exercise 18 of Sec. 7.3.) 


14. Suppose that Xj, ..., X,, form a random sample from 
an exponential distribution for which the value of the 
parameter 6 is unknown (@ > 0). Let €(@) denote the prior 
p.d.f. of 6, and let @ denote the Bayes estimator of 6 with 
respect to the prior p.d.f. (6) when the squared error loss 
function is used. Let yy = 6”, and suppose that instead of 
estimating 0, it is desired to estimate the value of w subject 
to the following squared error loss function: 


L(W,a)=(W —a)* forw>Oanda>0. 


Let yr denote the Bayes estimator of y. Explain why ~ > 
62. Hint: Look at Exercise 4 in Sec. 4.4. 
15. Let c > 0 and consider the loss function 


if 0 <a, 
ifd>a. 


cl@—al 


L6.0= 15a 


Assume that 6 has a continuous distribution. Prove that a 
Bayes estimator of 6 will be any 1/(1 + c) quantile of the 
posterior distribution of 6. Hint: The proof is a lot like the 
proof of Theorem 4.5.3. The result holds even if 6 does 
not have a continuous distribution, but the proof is more 
cumbersome. 


7.5 Maximum Likelihood Estimators 


Maximum likelihood estimation is a method for choosing estimators of parameters 
that avoids using prior distributions and loss functions. It chooses as the estimate 
of 0 the value of @ that provides the largest value of the likelihood function. 


Introduction 


Example 
7.5.1 


Lifetimes of Electronic Components. Suppose that we observe the data in Exam- 
ple 7.3.11 consisting of the lifetimes of three electronic components. Is there a method 
for estimating the failure rate 6 without first constructing a prior distribution and a 
loss function? | 


In this section, we shall develop a relatively simple method of constructing an 
estimator without having to specify a loss function and a prior distribution. It is called 
the method of maximum likelihood, and it was introduced by R. A. Fisher in 1912. 
Maximum likelihood estimation can be applied in most problems, it has a strong 
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Definition 
7.5.1 


Definition 
7.5.2 


intuitive appeal, and it will often yield a reasonable estimator of 6. Furthermore, if 
the sample is large, the method will typically yield an excellent estimator of 6. For 
these reasons, the method of maximum likelihood is probably the most widely used 
method of estimation in statistics. 


Note: Terminology. Because maximum likelihood estimation, as well as many other 
procedures to be introduced later in the text, do not involve the specification of a prior 
distribution of the parameter, some different terminology is often used in describing 
the statistical models to which these procedures are applied. Rather than saying that 
X1,..., X, are iid. with p.f. or p.df. f(x|@) conditional on 6, we might say that 
X1,..., X, form arandom sample from a distribution with p.f. or p.d.f. f(x|@) where 
0 is unknown. More specifically, in Example 7.5.1, we could say that the lifetimes form 
a random sample from the exponential distribution with unknown parameter 6. 


Definition of a Maximum Likelihood Estimator 


Let the random variables X,,..., X, form a random sample from a discrete distri- 
bution or a continuous distribution for which the p.f. or the p.d-f. is f(x|@), where the 
parameter @ belongs to some parameter space Q. Here, 6 can be either a real-valued 
parameter or a vector. For every observed vector x = (x1, ..., x,) in the sample, the 
value of the joint p.f. or joint p.d.f. will, as usual, be denoted by f,,(x|0). Because of 
its importance in this section, we repeat Definition 7.2.3. 


Likelihood Function. When the joint p.d.f. or the joint p.f. f,,(x|0) of the observations 
in a random sample is regarded as a function of 6 for given values of x1, ..., X,, it is 
called the likelihood function. 


Consider first, the case in which the observed vector x came from a discrete 
distribution. If an estimate of 6 must be selected, we would certainly not consider 
any value of 6 € Q for which it would be impossible to obtain the vector x that was 
actually observed. Furthermore, suppose that the probability f,(¥|@) of obtaining the 
actual observed vector x is very high when @ has a particular value, say, 9 = 0, and is 
very small for every other value of @ € . Then we would naturally estimate the value 
of 6 to be 4 (unless we had strong prior information that outweighed the evidence in 
the sample and pointed toward some other value). When the sample comes from a 
continuous distribution, it would again be natural to try to find a value of 6 for which 
the probability density f,(x|@) is large and to use this value as an estimate of 0. For 
each possible observed vector x, we are led by this reasoning to consider a value of 
@ for which the likelihood function f, (x|@) is a maximum and to use this value as an 
estimate of @. This concept is formalized in the following definition. 


Maximum Likelihood Estimator/Estimate. For each possible observed vector x, let 
6(x) € Q denote a value of 6 € Q for which the likelihood function f,,(x|@) is a max- 
imum, and let 6 = 5(X ) be the estimator of 6 defined in this way. The estimator 6 is 
called a maximum likelihood estimator of @. After X =x is observed, the value 5(x) 
is called a maximum likelihood estimate of 0. 


The expressions maximum likelihood estimator and maximum likelihood estimate are 
abbreviated M.L.E. One must rely on context to determine whether the abbreviation 
refers to an estimator or to an estimate. Note that the M.L.E. is required to be an 
element of the parameter space Q, unlike general estimators/estimates for which no 
such requirement exists. 


Example 
7.5.2 


Example 
7.5.3 
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Examples of Maximum Likelihood Estimators 


Lifetimes of Electronic Components. In Example 7.3.11, the observed data are X, =3, 
X>7 = 1.5, and X3 = 2.1. The random variables had been modeled as a random sample 
of size 3 from the exponential distribution with parameter 6. The likelihood function 
is, for 6 > 0, 


f3(x|0) = 6? exp (—6.60) , 


where x = (2, 1.5, 2.1). The value of 6 that maximizes the likelihood function /;(x|@) 
will be the same as the value of 6 that maximizes log f3(x|0), since log is an increasing 
function. Therefore, it will be convenient to determine the M.L.E. by finding the value 
of 6 that maximizes 


L(@) = log f3(x|) = 3 log(6) — 6.60. 


Taking the derivative dL(6)/d0, setting the derivative to 0, and solving for @ yields 
0 = 3/6.6 = 0.455. The second derivative is negative at this value of 6, so it provides 
a maximum. The maximum likelihood estimate is then 0.455. 4 


It should be noted that in some problems, for certain observed vectors x, the 
maximum value of f,,(¥|@) may not actually be attained for any point 6 € Q. In such 
a case, an M.L.E. of @ does not exist. For certain other observed vectors x, the 
maximum value of /,,(x|@) may actually be attained at more than one point in the 
space Q. In such a case, the M.L.E. is not uniquely defined, and any one of these 
points can be chosen as the value of the estimator 6. In many practical problems, 
however, the M.L.E. exists and is uniquely defined. 

We shall now illustrate the method of maximum likelihood and these various 
possibilities by considering several examples. In each example, we shall attempt to 
determine an M.L.E. 


Test for a Disease. Suppose that you are walking down the street and notice that the 
Department of Public Health is giving a free medical test for a certain disease. The 
test is 90 percent reliable in the following sense: If a person has the disease, there is a 
probability of 0.9 that the test will give a positive response; whereas, if a person does 
not have the disease, there is a probability of only 0.1 that the test will give a positive 
response. This same test was considered in Example 2.3.1. We shall let X stand for 
the result of the test, where X = 1 means that the test is positive and X = 0 means 
that the test is negative. Let the parameter space be Q = {0.1, 0.9}, where 0 = 0.1 
means that the person tested does not have the disease, and 6 = 0.9 means that the 
person has the disease. This parameter space was chosen so that, given 0, X has the 
Bernoulli distribution with parameter 6. The likelihood function is 


f(xla) =e*(1—9)!, 
If x = 0 is observed, then 


0.9 if6=0.1, 


Old = 
oo ee if 6 =0.9. 


Clearly, 8 = 0.1 maximizes the likelihood when x = 0 is observed. If x = 1 is observed, 
then 
0.1 if@=0.1, 


1 = 
ai ee if@ = 0.9. 
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7.5.4 


Example 
7.5.5 


Clearly, 6 = 0.9 maximizes the likelihood when x = 1 is observed. Hence, we have 
that the M.L.E. is 


j= {0 if X —0, 
~ (0.0 dx = 1, < 


Sampling from a Bernoulli Distribution. Suppose that the random variables X,..., X, 
form a random sample from the Bernoulli distribution with parameter 0, which is 
unknown (0 < @ < 1). For all observed values x;, ..., x,,, where each x; is either 0 or 
1, the likelihood function is 


n 
fn(xl0) =] Jo" —-@y'™. (7.5.1) 
i=1 
Instead of maximizing the likelihood function f,(x|6) directly, it is again easier to 
maximize log f,(x|6): 


L@) =log f,(x10) = ) Ly; log @ + (1 — x;) log — 8)] 
i=1 


a . “| log 6 + (> — = “| log(1 — 6). 


i=l i=l 

Now calculate the derivative dL(0)/d0, set this derivative equal to 0, and solve 

the resulting equation for 0. If }*"_, x; ¢ {0, n}, we find that the derivative is 0 at 
0 =X,, and it can be verified (for example, by examining the second derivative) 
that this value does indeed maximize L(@) and the likelihood function defined by 
Eq. (7.5.1). If )>"_, x; = 0, then L(@) is a decreasing function of @ for all 6, and hence 
L achieves its maximum at 6 = 0. Similarly, if }>"_, x; =n, L is an increasing function, 
and it achieves its maximum at 6 = 1. In these last two cases, note that the maximum 
of the likelihood occurs at 6 = X,,. It follows, therefore, that the M.LE. of 4 is = X,,. 
<4 


It follows from Example 7.5.4 thatif X,,..., X,, are regarded as n Bernoulli trials 
and if the parameter space is Q = [0, 1], then the M.L.E. of the unknown probability 
of success on any given trial is simply the proportion of successes observed in the 
n trials. In Example 7.5.3, we have n = 1 Bernoulli trial, but the parameter space 
is Q = {0.1, 0.9} rather than [0, 1], and the M.L.E. differs from the proportion of 
successes. 


Sampling from a Normal Distribution with Unknown Mean. Suppose that X;,..., X,, 
form a random sample from a normal distribution for which the mean jz is unknown 
and the variance o” is known. For all observed values x), ..., x,, the likelihood 
function f,(x|2) will be 


7 1 1 n 4 


It can be seen from Eq. (7.5.2) that f,,(x|) will be maximized by the value of jz that 
minimizes 


n n n 
O(u) =; — nw) = 0 x7 — 2 Yo x; tmp. 
i=1 i=l i=l 


Example 
7.5.6 
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We see that Q is a quadratic in jz with positive coefficient on ju. It follows that 
Q will be minimized where its derivative is 0. If we now calculate the derivative 
dQ()/d, set this derivative equal to 0, and solve the resulting equation for w, we 
find that = x,,. It follows, therefore, that the M.L.E. of wis fi = X,,. <l 


It can be seen in Example 7.5.5 that the estimator (2 is not affected by the value 
of the variance o?, which we assumed was known. The M.L.E. of the unknown mean 
wis simply the sample mean X,,, regardless of the value of ”. We shall see this again 
in the next example, in which both w and o? must be estimated. 


Sampling from a Normal Distribution with Unknown Mean and Variance. Suppose again 
that X,,..., X, form a random sample from a normal distribution, but suppose 
now that both the mean yw and the variance o” are unknown. The parameter is then 
6 = (ut, 0”). For all observed values x1, ..., x,, the likelihood function f,(x|, 07) 
will again be given by the right side of Eq. (7.5.2). This function must now be 
maximized over all possible values of jz and 07, where —oo < pp < 00 and o7 > 0. 
Instead of maximizing the likelihood function f,(x|, 7) directly, it is again easier 
to maximize log f,(x|u, 07). We have 


L(6) = log f, (xl, 07) 


on n 2 Tce 2 
"5 log (27) 5 logo 552 2 [). (7.5.3) 


We shall find the value of 6 = (4, 0”) for which L(@) is maximum in three 
stages. First, for each fixed o”, we shall find the value ji(o”) that maximizes the right 
side of (7.5.3). Second, we shall find the value o? of o? that maximizes L(6’) when 
6’ = (fi(o”), 0°). Finally, the M.L.E. of 6 will be the random vector whose observed 
value is (ji(2), o2). The first stage has already been solved in Example 7.5.5. There, 
we obtained fi(o?) = x,,. For the second stage, we set 0’ = (X,, o”) and maximize 


n 


yn nN n 2 1 _ 2 


This can be maximized by setting its derivative with respect to o* equal to 0 and 
solving for o”. The derivative is 
d nil 1 ” 2 
— LO’) = + X; —Xy)°. 
GO = a gms LI p= Ey) 


Setting this to 0 yields 


n 


1 

2 = 2 

=- ;— : Tio 
7 Gi) (755) 
The second derivative of (7.5.4) is negative at the value of a? in (7.5.5), so we have 
found the maximum. Therefore, the M.L.E. of 6 = (11, 07) is 


age, =. eee = 

6 = (f, 02) = (x. ; Yi(xi - %,) (7.5.6) 
i=1 

Notice that the first coordinate of the M.L.E. in Eq. (7.5.6) is called the sample 

mean of the data. Likewise, we call the second coordinate of this M.L.E. the sample 

variance. It is not difficult to see that the observed value of the sample variance is 
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the variance of a distribution that assigns probability 1/n to each of the n observed 
values x1, ..., x, in the sample. (See Exercise 1.) < 


Sampling froma Uniform Distribution. Suppose that X;,..., X, formarandom sample 
from the uniform distribution on the interval [0, 6], where the value of the parameter 
@ is unknown (6 > 0). The p.d.f. f(x|@) of each observation has the following form: 


f(x|0) = {’ for0 <x <8, (7.5.7) 
0 otherwise. 


Therefore, the joint p.d.f. f,,(x|0) of X,,..., X,, has the form 


fy(sl6) =| forO<x,<@(@=1,...,n), (7.5.8) 
0 otherwise. 

It can be seen from Eq. (7.5.8) that the M.L.E. of 6 must be a value of 6 for 
which 6 > x; fori =1,..., and that maximizes 1/6” among all such values. Since 
1/0” is a decreasing function of 6, the estimate will be the smallest value of @ such 
that 6 > x; fori=1,...,7. Since this value is 6 = max{x,,..., x,}, the M.L.E. of 6 
is 6 = max{X},..., X,}. < 


Limitations of Maximum Likelihood Estimation 


Example 
7.5.8 


Despite its intuitive appeal, the method of maximum likelihood is not necessarily 
appropriate in all problems. For instance, in Example 7.5.7, the M.L.E. 6 does not 
seem to be a suitable estimator of 6. Since max{X,, ..., X,,} < 6 with probability 1, it 
follows that 6 surely underestimates the value of 6. Indeed, if any prior distribution 
is assigned to 0, then the Bayes estimator of @ will surely be greater than 6. The 
actual amount by which the Bayes estimator exceeds 6 will, of course, depend on the 
particular prior distribution that is used and on the observed values of X),..., X;. 
Example 7.5.7 also raises another difficulty with maximum likelihood, as we illustrate 
in Example 7.5.8. 


Nonexistence ofan M.L.E. Suppose again that X;,..., X,, formarandom sample from 
the uniform distribution on the interval [0, 0]. However, suppose now that instead of 
writing the p.d.f. f(x|@) of the uniform distribution in the form given in Eq. (7.5.7), 
we write it in the following form: 


f(x|0) = p ford<x <8, (7.5.9) 
0 otherwise. 

The only difference between Eq. (7.5.7) and Eq. (7.5.9) is that the value of 
the p.d.f. at each of the two endpoints 0 and @ has been changed by replacing the 
weak inequalities in Eq. (7.5.7) with strict inequalities in Eq. (7.5.9). Therefore, 
either equation could be used as the p.d-f. of the uniform distribution. However, 
if Eq. (7.5.9) is used as the p.d.f, then an M.L.E. of 6 will be a value of 6 for which 
6 > x; fori =1,..., and which maximizes 1/0” among all such values. It should be 
noted that the possible values of 6 no longer include the value 6 = max{x,,..., x,}, 
because 6 must be strictly greater than each observed value x; @ =1,..., 1). Because 
@ can be chosen arbitrarily close to the value max{x;, ..., x,} but cannot be chosen 
equal to this value, it follows that the M.L.E. of @ does not exist. < 


In all of our previous discussions about p.d.f.’s, we emphasized the fact that it is 
irrelevant whether the p.d.f. of the uniform distribution is chosen to be equal to 1/6 


Example 
7.5.9 


Example 
7.5.10 
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over the open interval 0 < x < 6 or over the closed interval 0 < x < 6. Now, however, 
we see that the existence of an M.L.E. depends on this irrelevant and unimportant 
choice. This difficulty is easily avoided in Example 7.5.8 by using the p.d.f. given by 
Eq. (7.5.7) rather than that given by Eq. (7.5.9). In many other problems as well, a 
difficulty of this type can be avoided simply by choosing one particular appropriate 
version of the p.d.f. to represent the given distribution. However, as we shall see in 
Example 7.5.10, the difficulty cannot always be avoided. 


Non-uniqueness of an M.L.E. Suppose that X;,..., X, form a random sample from 
the uniform distribution on the interval [6, 6 + 1], where the value of the parameter 
9 is unknown (—oo < @ < ov). In this example, the joint p.d-f. f, (v|@) has the form 


f,(x10) = | 1 ford =% <04+1,@=1,...,n), (7.5.10) 
0 otherwise. 

The condition that 6 < x; fori =1,..., is equivalent to the condition that 6 < 
min{x;,..., x,}. Similarly, the condition that x; <6 +1fori=1,...,nis equivalent 
to the condition that 6 > max{x,,..., x, } — 1. Therefore, instead of writing f, (x|0) 
in the form given in Eq. (7.5.10), we can use the following form: 

F,(|6) = 1 for maxx, 22g %,) = 12S? <minizy «:. 53, (75.11) 
0 otherwise. 
Thus, it is possible to select as an M.L.E. any value of 6 in the interval 
max{x;,...,x,}—-1<0 <min{x,,...,x,}. (7.5.12) 


In this example, the M.L.E. is not uniquely specified. In fact, the method of 
maximum likelihood provides very little help in choosing an estimate of 6. The 
likelihood of every value of @ outside the interval (7.5.12) is actually 0. Therefore, 
no value 6 outside this interval would ever be estimated, and all values inside the 
interval are M.L.E.’s. < 


Sampling from a Mixture of Two Distributions. Consider a random variable X that can 
come with equal probability either from the normal distribution with mean 0 and 
variance 1 or from another normal distribution with mean jz and variance o?, where 
both and o? are unknown. Under these conditions, the p.d.f. f (x|w, 07) of X will 
be the average of the p.d-f.’s of the two different normal distributions. Thus, 


1 1 2 1 — p)2 
pee N= | Qn)'2 e( ) * @n)"Po el a I}: very 


Suppose now that X,,..., X,, form a random sample from the distribution for 
which the p.d.f. is given by Eq. (7.5.13). As usual, the likelihood function f,(x|, 07) 
has the form 


fixie, 0°?) =] f@ilu, 07). (7.5.14) 


i=l 


To find the M.L.E. of 6 = (yu, 0”), we must find values of yw and o? for which 
f,(x|u, 07) is maximized. 

Let x, denote any one of the observed values x,, ..., x,. If we let ~ = x, and let 
o* > 0, then the factor f (x,|z, 0”) on the right side of Eq. (7.5.14) will grow large 
without bound, while each factor f (x;|, 07) for x; 4x, will approach the value 
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Hence, when pz = x, and o” — 0, we find that f,,(x|u, 07) > oo. 

The value 0 is not a permissible estimate of 7, because we know in advance that 
o* > 0. Since the likelihood function can be made arbitrarily large by choosing pu = x; 
and choosing o” arbitrarily close to 0, it follows that the M.L.E. does not exist. 

If we try to correct this difficulty by allowing the value 0 to be a permissible 
estimate of o”, then we find that there are n different M.L.E.’s of x and o?; namely, 


6, = (ft, 02) = (X;, 0) fork=1,...,n. 


None of these estimators seems appropriate. Consider again the description, given 
at the beginning of this example, of the two normal distributions from which each 
observation might come. Suppose, for example, that n = 1000, and we use the esti- 
mator 63 = (X3, 0). Then, we would be estimating the value of the unknown variance 
to be 0; also, in effect, we would be behaving as if exactly one of the X;’s (namely, 
X3) comes from the given unknown normal distribution, whereas all the other 999 
observation values come from the normal distribution with mean 0 and variance 1. 
In fact, however, since each observation was equally likely to come from either of the 
two distributions, itis much more probable that hundreds of observations, rather than 
just one, come from the unknown normal distribution. In this example, the method of 
maximum likelihood is obviously unsatisfactory. A Bayesian solution to this problem 
is outlined in Exercise 10 in Sec. 12.5. < 


Finally, we shall mention one point concerning the interpretation of the M.L.E. 
The M.L.E. is the value of 6 that maximizes the conditional p.f. or p.d.f. of the data X 
given 0. Therefore, the maximum likelihood estimate is the value of 6 that assigned 
the highest probability to seeing the observed data. It is not necessarily the value of 
the parameter that appears to be most likely given the data. To say how likely are 
different values of the parameter, one would need a probability distribution for the 
parameter. Of course, the posterior distribution of the parameter (Sec. 7.2) would 
serve this purpose, but no posterior distribution is involved in the calculation of the 
M.L.E. Hence, it is not legitimate to interpret the M.L.E. as the most likely value of 
the parameter after having seen the data. 

For example, consider a situation covered by Example 7.5.4. Suppose that we 
are going to flip a coin a few times, and we are concerned with whether or not it 
has a slight bias toward heads or toward tails. Let X; = 1 if the ith flip is heads and 
X; = 0 if not. If we obtain four heads and one tail in the first five flips, the observed 
value of the M.L.E. will be 0.8. But it would be difficult to imagine a situation in 
which we would feel that the most likely value of 6, the probability of heads, is as 
large as 0.8 based on just five tosses of what appeared a priori to be a typical coin. 
Treating the M.L.E. as if it were the most likely value of the parameter is very much 
the same as ignoring the prior information about the rare disease in the medical test 
of Examples 2.3.1 and 2.3.3. If the test is positive in these examples, we found (in 
Example 7.5.3) that the M.L.E. takes the value 6 = 0.9, which corresponds to having 
the disease. However, if the prior probability that you have the disease is as small 
as in Example 2.3.1, the posterior probability that you have the disease (@ = 0.9) 
is still small even after the positive test result. The test is not accurate enough to 
completely overcome the prior information. So too with our coin tossing; five tosses 
are not enough information to overcome prior beliefs about the coin being typical. 
Only when the data contain much more information than is available a priori would 
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it be approximately correct to think of the M.L.E. as the value that we believe the 
parameter is most likely to be near. This could happen either when the M.L.E. is 
based on a lot of data or when there is very little prior information. 


Summary 


The maximum likelihood estimate of a parameter @ is that value of 6 that provides 
the largest value of the likelihood function f,,(x|@) for fixed data x. If 6(x) denotes the 
maximum likelihood estimate, then 6 = 5(X) is the maximum likelihood estimator 
(M.L.E.). We have computed the M.L.E. when the data comprise a random sample 
from a Bernoulli distribution, a normal distribution with known variance, a normal 
distribution with both parameters unknown, or the uniform distribution on the 


interval [0, 6] or on the interval [0, 6 + 1]. 


Exercises 


1. Let x,,..., x, be distinct numbers. Let Y be a discrete 
random variable with the following p.f: 


ro= {a if y € {xy,..., Xp}, 


0 otherwise. 
Prove that Var(Y) is given by Eq. (7.5.5). 


2. It is not known what proportion p of the purchases of a 
certain brand of breakfast cereal are made by women and 
what proportion are made by men. In a random sample of 
70 purchases of this cereal, it was found that 58 were made 
by women and 12 were made by men. Find the M.L.E. of p. 


3. Consider again the conditions in Exercise 2, but sup- 
pose also that it is known that 5 <p< Z. If the observa- 
tions in the random sample of 70 purchases are as given 
in Exercise 2, what is the M.L.E. of p? 


4. Suppose that X,,..., X, form a random sample from 
the Bernoulli distribution with parameter 6, which is un- 
known, but it is known that 6 lies in the open interval 
0 <6 <1.Show that the M.L.E. of 6 does not exist if every 
observed value is 0 or if every observed value is 1. 


5. Suppose that X,,..., X, form a random sample from 
a Poisson distribution for which the mean @ is unknown, 
(0 > 0). 
a. Determine the M.L.E. of 6, assuming that at least 
one of the observed values is different from 0. 


b. Show that the M.L.E. of @ does not exist if every 
observed value is 0. 


6. Suppose that X,,..., X, form a random sample from 
a normal distribution for which the mean yz is known, but 
the variance o? is unknown. Find the M.L.E. of o?. 


7. Suppose that X;,..., X, form a random sample from 
an exponential distribution for which the value of the 
parameter # is unknown (f > 0). Find the M.L.E. of £. 


8. Suppose that X,,..., X,, form a random sample from 
a distribution for which the p.d.f. f(x|@) is as follows: 


e’-* forx > 86, 
0 for x <0. 


rosie) =| 


Also, suppose that the value of 6 is unknown (—oo <6 < 
Co). 

a. Show that the M.L.E. of 6 does not exist. 

b. Determine another version of the p.d.f. of this same 


distribution for which the M.L.E. of 6 will exist, and 
find this estimator. 


9. Suppose that X,..., X,, form arandom sample from a 
distribution for which the pdf. f(x|@) is as 
follows: 


F (x10) = ax°1 for0<x <1, 
0 otherwise. 


Also, suppose that the value of 6 is unknown (6 > 0). Find 
the M.L.E. of 6. 


10. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d.f. f(x|@) is as follows: 


fA) = ser for —oo <x <oo. 


Also, suppose that the value of @ is unknown (—oo < 
@ < oo). Find the M.L.E. of 6. Hint: Compare this to the 
problem of minimizing M.A.E as in Theorem 4.5.3. 
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11. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution on the interval [6;, 65], where 
both 6; and 6) are unknown (—oo < 6) < 4) < co). Find the 
M.L.E.’s of 6; and 6. 


12. Suppose that a certain large population contains k 
different types of individuals (k > 2), and let 6; denote 
the proportion of individuals of type 7, fori =1,...,k. 
Here, 0 < 6; <1 and 6; +---+6,=1. Suppose also that 
in arandom sample of n individuals from this population, 


exactly n; individuals are of type i, where n; +---+nyp= 
n. Find the M.L.E.’s of 6), ..., Oy. 


13. Suppose that the two-dimensional vectors (X,, Yj), 
(X>, Yo),..., (X,, Y,) form a random sample from a bi- 
variate normal distribution for which the means of X and 
Y are unknown but the variances of X and Y and the cor- 
relation between X and Y are known. Find the M.L.E.’s of 
the means. 


7.6 Properties of Maximum Likelihood Estimators 


In this section, we explore several properties of M.L.E.’s, including: 


¢ The relationship between the M.L.E. of a parameter and the M.L.E. of a 
function of that parameter 


¢ The need for computational algorithms 
¢ The behavior of the M.L.E. as the sample size increases 
¢ The lack of dependence of the M.L.E. on the sampling plan 


We also introduce a popular alternative method of estimation (method of mo- 
ments) that sometimes agrees with maximum likelihood, but can sometimes be 
computationally simpler. 


Lifetimes of Electronic Components. In Example 7.1.1, the parameter 6 was interpreted 
as the failure rate of electronic components. In Example 7.4.8, we found a Bayes 
estimate of y =1/0, the average lifetime. Is there a corresponding method for 
computing the M.L.E. of y? | 


., X, form a random sample from a distribution for which 
either the p.f. or the p.d.f is f(x|0), where the value of the parameter 0 is unknown. 
The parameter may be one-dimensional or a vector of parameters. Let 6 denote the 
., X,, the likelihood function f,,(x|6) 


Suppose now that we change the parameter in the distribution as follows: Instead 
of expressing the p.f. or the p.d.f. f(x|@) in terms of the parameter 0, we shall express 
it in terms of a new parameter y = g(@), where g is a one-to-one function of 0. Is 
there a relationship between the M.L.E. of 6 and the M.L.E. of y? 


Invariance Property of M.L.E.’s. If 6 is the maximum likelihood estimator of 6 and if g 


Invariance 
Example 
7.6.1 
Suppose that Xj, .. 
M.L.E. of 6. Thus, for all observed values x), .. 
is maximized when 6 = 6. 
Theorem 
7.6.1 


is a one-to-one function, then (0) is the maximum likelihood estimator of g(@). 


Proof The new parameter space is I’, the image of Q under the function g. We 
shall let 6 = h(yw) denote the inverse function. Then, expressed in terms of the new 
parameter w, the p.f. or p.d.f. of each observed value will be f[x|h(W)], and the 
likelihood function will be f,,[x|h(W)]. 

The M.LE. w of w will be equal to the value of w for which f,[x|A(W)] 
is maximized. Since f,,(x|@) is maximized when 6 = 6, it follows that f,[x|h(W)] is 


Example 
7.6.2 


Definition 
7.6.1 


Theorem 
7.6.2 


Example 
7.6.3 
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maximized when h(w) = 6. Hence, the M.L.E. v must satisfy the relation hh) =6 
or, equivalently, y = g(6). 


Lifetimes of Electronic Components. According to Theorem 7.6.1, the M.L.E. of w is 
one over the M.L.E. of 6. In Example 7.5.2, we computed the observed value of 
6 = 0.455. The observed value of y would then be 1/0.455 = 2.2. This is a bit smaller 
than the Bayes estimate using squared error loss of 2.867 found in Example 7.4.8. 

< 


The invariance property can be extended to functions that are not one-to-one. 
For example, suppose that we wish to estimate the mean y of a normal distribution 
when both the mean and the variance are unknown. Then jj is not a one-to-one 
function of the parameter 6 = (yw, o2). In this case, the function we wish to estimate 
is g(0) = yw. There is a way to define the M.L.E. ofa function of 0 that is not necessarily 
one-to-one. One popular way is the following. 


M.L.E. of a Function. Let g(@) be an arbitrary function of the parameter, and let G be 
the image of Q under the function g. For each t € G, define G, = {0 : g(9) =r} and 
define 


L*(t)= ] 6). 
(t) a og fr(xl@) 


Finally, define the M.L.E. of g(@) to be ¢ where 
L*(f) = max L*(t). (7.6.1) 
teG 


The following result shows how to find the M.L.E. of g(6) based on Definition 7.6.1. 


Let @ be an M.LE. of 6, and let g(0) be a function of 6. Then an M.L.E. of g(@) is 
g(@). 


Proof We shall prove that 7 = g(6) satisfies (7.6.1). Since L*(r) is the maximum of 
log f,,(x|9) over @ in a subset of Q, and since log f, (x|6) is the maximum over all 6, 
we know that L*(t) < log f(x16) for allt eG. Let f= (0). We are done if we can 
show that L*(f) = log f, (x|6). Note that 6 € G;. Since 6 maximizes Ff, (x10) over all 6, 
it also maximizes f,,(x|0) over 6 € G;. Hence, L*(f) = log f,(x|0) and f = g(6) is an 
M.L.E. of g(@). rT] 


Estimating the Standard Deviation and the Second Moment. Suppose that X1,..., X, 
form arandom sample from a normal distribution for which both the mean pw and the 
variance o” are unknown. We shall determine the M.L.E. of the standard deviation 
o and the M.L.E. of the second moment of the normal distribution E (X?). It was 
found in Example 7.5.6 that the M.L.E. of 6 = (u, o2) is 6= (jt, o2). From the 
invariance property, we can conclude that the M.L.E. o of the standard deviation 
is simply the square root of the sample variance. In symbols, ¢ = (o2)"/?, Also, since 
E(X*) =o? +p”, the M.L.E. of E(X”) will be o2 + fi. < 


Consistency 


Consider an estimation problem in which a random sample is to be taken from a 
distribution involving a parameter 6. Suppose that for every sufficiently large sample 
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Example 
7.6.4 


size n, that is, for every value of n greater than some given minimum number, there 
exists a unique M.L.E. of 6. Then, under certain conditions, which are typically 
satisfied in practical problems, the sequence of M.L.E.’s is a consistent sequence of 
estimators of 9. In other words, in such problems the sequence of M.L.E.’s converges 
in probability to the unknown value of 6 as n > oo. 

We have remarked in Sec. 7.4 that under certain general conditions the sequence 
of Bayes estimators of a parameter @ is also a consistent sequence of estimators. 
Therefore, for a given prior distribution and a sufficiently large sample size n, the 
Bayes estimator and the M.L.E. of 6 will typically be very close to each other, and 
both will be very close to the unknown value of 6. 

We shall not present any formal details of the conditions that are needed to 
prove this result. (Details can be found in chapter 7 of Schervish, 1995.) We shall, 
however, illustrate the result by considering again arandom sample Xj, ..., X,, from 
the Bernoulli distribution with parameter 0, which is unknown (0 < @ < 1). It was 
shown in Sec. 7.4 that if the given prior distribution of 6 is a beta distribution, then 
the difference between the Bayes estimator of 6 and the sample mean X,, converges 
to 0 as n + oo. Furthermore, it was shown in Example 7.5.4 that the M.L.E. of 6 is 
X,,. Thus, as n > oo, the difference between the Bayes estimator and the M.L.E. will 
converge to 0. Finally, the law of large numbers (Theorem 6.2.4) says that the sample 
mean X,, converges in probability to 6 as n + oo. Therefore, both the sequence of 
Bayes estimators and the sequence of M.L.E.’s are consistent sequences. 


Numerical Computation 


In many problems there exists a unique M.L.E. 6 of a given parameter 0, but this 
M.L.E. cannot be expressed in closed form as a function of the observations in the 
sample. In such a problem, for a given set of observed values, it is necessary to 
determine the value of 6 by numerical computation. We shall illustrate this situation 
by two examples. 


Sampling from a Gamma Distribution. Suppose that X;, ..., X,, form arandom sample 
from the gamma distribution for which the p.d.f. is as follows: 
fla) = tots for x > 0. (7.6.2) 
l(a) 


Suppose also that the value of a is unknown (@ > 0) and is to be estimated. 
The likelihood function is 


n a-l n 
fyr(xla) = aw (11 “) ex(- d, “| : (7.6.3) 


The M.L.E. of a will be the value of a that satisfies the equation 


SOE TIA) 9, (7.6.4) 
da 
When we apply Eq. (7.6.4) in this example, we obtain the following equation: 
@) 1 
=—) logx;. 7.6.5 
ve = 8 0g X; (7.6.5) 


Tables of the function T’’(~)/T'(@), which is called the digamma function, are 
included in various published collections of mathematical tables. The digamma func- 
tion is also available in several mathematical software packages. For all given values 


Example 
7.6.5 


Definition 
7.6.2 


Figure 7.7 Newton’s 
method to approximate the 
solution to f(6) =0. The 
initial guess is 6), and the 
revised guess is 6}. 
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of x1, ..., X,, the unique value of a that satisfies Eq. (7.6.5) must be determined either 
by referring to these tables or by carrying out a numerical analysis of the digamma 
function. This value will be the M.L.E. of a. < 


Sampling from a Cauchy Distribution. Suppose that X,,..., X,, form arandom sample 
from a Cauchy distribution centered at an unknown point 0 (—oo < 6 < oo), for which 
the p.d.f. is as follows: 


1 
fA|O) = x [lt 6y] for —oo <x <0. (7.6.6) 


Suppose also that the value of @ is to be estimated. 
The likelihood function is 


1 
x|0) = : 7.6.7 
Therefore, the M.L.E. of 6 will be the value that minimizes 
n 

Il [1 Se Gi 6). (7.6.8) 

i=1 
For most values of x1,...,x,, the value of @ that minimizes the expression (7.6.8) 
must be determined by a numerical computation. <4 


An alternative to exact solution of Eq. (7.6.4) is to start with a heuristic estimator 
of w and then apply Newton’s method. 


Newton’s Method. Let (8) be a real-valued function of a real variable, and suppose 
that we wish to solve the equation f (6) = 0. Let 69 be an initial guess at the solution. 
Newton's method replaces the initial guess with the updated guess 


_ FM) 
f'(o) 


1=20 


The rationale behind Newton’s method is illustrated in Fig. 7.7. The function 
f (8) is the solid curve. Newton’s method approximates the curve by a line tangent to 
the curve, that is, the dashed line passing through the point (69, f(8)), indicated by 
the circle. The approximating line crosses the horizontal axis at the revised guess 6. 
Typically, one replaces the intial guess with the revised guess and iterates Newton’s 
method until the results stabilize. 


Illustration of Newton’s Method 
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Example 
7.6.7 


Definition 
7.6.3 


Example 
7.6.8 


Sampling from a Gamma Distribution. In Example 7.6.4, suppose that we observe 

n = 20 gamma random variables Xj, ..., X29 with parameters a and 1. Suppose that 

the observed values are such that a Sean log(x;) = 1.220 and o an X; = 3.679. We 

wish to use Newton’s method to approximate the M.L.E. A sensible initial guess is 

based on the fact that E(X;) =a. This suggests using a) = 3.679, the sample mean. 

The function f(a) is y(a@) — 1.220, where yw is the digamma function. The derivative 

f'(a) is w'(@), which is known as the trigamma function. Newton’s method updates 

the intial guess ag to 

_ W(a) — 1.220 — 3.679 1.1607 — 1.220 ~ 3.871. 

w’ (ao) 0.3120 

Here, we have used statistical software that computes both the digamma and 

trigamma functions. After two more iterations, the approximation stabilizes at 3.876. 
< 


ay =A 


Newton’s method can fail terribly if f’(0)/f (0) gets close to 0 between 6) and the 
actual solution to f (@) = 0. There is a multidimensional version of Newton’s method, 
which we will not present here. There are also many other numerical methods for 
maximizing functions. Any text on numerical optimization, such as Nocedal and 
Wright (2006), will describe some of them. 


Method of Moments 


Sampling from a Gamma Distribution. Suppose that X,,..., X, form a random sam- 
ple from the gamma distribution with parameters a and f. In Example 7.6.4, we 
explained how one could find the M.L.E. of a if 6 were known. The method involved 
the digamma function, which is unfamiliar to many people. A Bayes estimate would 
also be difficult to find in this example because we would have to integrate a func- 
tion that includes a factor of 1/T'(a)”. Is there no other way to estimate the vector 
parameter 6 in this example? J 


The method of moments is an intuitive method for estimating parameters when 
other, more attractive, methods may be too difficult. It can also be used to obtain an 
initial guess for applying Newton’s method. 


Method of Moments. Assume that X),..., X, form a random sample from a dis- 
tribution that is indexed by a k-dimensional parameter 6 and that has at least k 
finite moments. For j =1,...,k, let u;(0) = E(Xj|0). Suppose that the function 
(0) = (444 (9), ..., Uz (0)) is a one-to-one function of 6. Let M(4,..., “,) denote 
the inverse function, that is, for all 6, 


6 = M(4(0), ..., Hy (8)). 


Define the sample moments by m; = 4 aan x for j =1,...,k. The method of 
moments estimator of 0 is M(my,..., mj). 


The usual way of implementing the method of moments is to set up the & equations 
mj = 1; (6) and then solve for 6. 


Sampling from a Gamma Distribution. In Example 7.6.4, we considered a sample of 
size n from the gamma distribution with parameters a and 1. The mean of each 


Example 
7.6.9 


Example 
7.6.10 


Theorem 
7.6.3 
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such random variable is w4(@) = a. The method of moments estimator is then @ = 
m4, the sample mean. This was the initial guess used to start Newton’s method in 
Example 7.6.6. < 


Sampling from a Gamma Distribution with Both Parameters Unknown. Theorem 5.7.5 
tells us that the first two moments of the gamma distribution with parameters a and 
B are 


The method of moments says to replace the right-hand sides of these equations by 
the sample moments and then solve for a and f. In this case, we get 


m 
a= i 2? 
mM, — my 

A my, 
B= ; 
mM, — my 


as the method of moments estimators. Note that mz — my is just the sample variance. 
< 


Sampling froma Uniform Distribution. Suppose that X,,..., X,, formarandom sample 
from the uniform distribution on the interval [6, 6 + 1], as in Example 7.5.9. In that 
example, we found that the M.L.E. is not unique and there is an interval of M.L.E.’s 


max{x,,...,x,}—-1<0<min{x,,...,x,}. (7.6.9) 


This interval contains all of the possible values of 6 that are consistent with the ob- 
served data. We shall now apply the method of moments, which will produce a single 
estimator. The mean of each X; is 6 + 1/2, so the method of moments estimator is 
X,, — 1/2. Typically, one would expect the observed value of the method of moments 
estimator to be a number in the interval (7.6.9). However, that is not always the case. 
For example, ifn =3 and X; = 0.2, X, = 0.99, X3 = 0.01 are observed, then (7.6.9) is 
the interval [—0.01, 0.01], while X, = 0.4. The method of moments estimate is then 
—0.1, which could not possibly be the true value of 6. <4 


There are several examples in which method of moments estimators are also 
M.L.E.’s. Some of these are the subjects of exercises at the end of this section. 

Despite occasional problems such as Example 7.6.10, the method of moments 
estimators will typically be consistent in the sense of Definition 7.4.6. 


Suppose that X,, X>,... are ii.d. with a distribution indexed by a k-dimensional pa- 
rameter vector 6. Suppose that the first A moments of that distribution exist and are 
finite for all 6. Suppose also that the inverse function M in Definition 7.6.3 is contin- 
uous. Then the sequence of method of moments estimators based on Xy,..., X,, 18 
a consistent sequence of estimators of 6. 


Proof The law of large numbers says that the sample moments converge in prob- 
ability to the moments 1(0),..., 4,(0). The generalization of Theorem 6.2.5 to 


432 


Chapter 7 Estimation 


Example 
7.6.11 
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functions of k variables implies that M evaluated at the sample moments (i.e., the 
method of moments estimator) converges in probability to 0. a 


M.L.E.’s and Bayes Estimators 


Bayes estimators and M.L.E.’s depend on the data solely through the likelihood 
function. They use the likelihood function in different ways, but in many problems 
they will be very similar. When the function f(x|0) satisfies certain smoothness 
conditions (as a function of @), it can be shown that the likelihood function will tend to 
look more and more like anormal p.d.f. as the sample size increases. More specifically, 
as n increases, the likelihood function starts to look like a constant (not depending 
on 6, but possibly depending on the data) times 


___!| @_6y 
exp | BV, ()/n (0 — 6) ; (7.6.10) 


where 6 is the M.LE. and V,,(0) is a sequence of random variables that typically 
converges as n — oo to a limit that we shall call v.,(@). When n is large, the function 
in (7.6.10) rises quickly to its peak as 9 approaches 6 and then drops just as quickly as @ 
moves away from 6. Under these conditions, so long as the prior p.d.f. of 6 is relatively 
flat compared to the very peaked likelihood function, the posterior p.d.f. will look a 
lot like the likelihood multiplied by the constant needed to turn it into a p.d-f. The 
posterior mean of 4 will then be approximately 6. In fact, the posterior distribution of 
6 will be approximately the normal distribution with mean 6 and variance Vn (@) /n.In 
similar fashion, the distribution of the maximum likelihood estimator (given @) will 
be approximately the normal distribution with mean 6 and variance v,,(0)/n. The 
conditions and proofs needed to make these claims precise are beyond the scope of 
this text but can be found in chapter 7 of Schervish (1995). 


Sampling from an Exponential Distribution. Suppose that X,, X>,... are i.i.d. having 
the exponential distribution with parameter 0. Let T,, = )°"_, X;. Then the M.L.E. of 
@ is 6, =n/T,. (This was found in Exercise 7 in Sec. 7.5.) Because 1/6, is an average 
of i.i.d. random variables with finite variance, the central limit theorem tells us that 
the distribution of 1/6, is approximately normal. The mean and variance, in this case, 
of that approximate normal distribution are, respectively, 1/0 and 1/(62n). The delta 
method (Theorem 6.3.2) says that 6 then has approximately the normal distribution 
with mean 6 and variance 6”/n. In the notation above, we have V,,(@) = 62. 

Next, let the prior distribution of 6 be the gamma distribution with parameters 
a and £. Theorem 7.3.4 says that the posterior distribution of 6 will be the gamma 
distribution with parameters a +n and 6 +1t,. We conclude by showing that this 
gamma distribution is approximately a normal distribution. Assume for simplicity 
that a is an integer. Then the posterior distribution of 6 is the same as the distribution 
of the sum of a + n 1.i.d. exponential random variables with parameter 6 + ¢,. Such 
a sum has approximately the normal distribution with mean (@ +n)/(6 +¢,) and 
variance (a@ + n)/(B + ae Ifa and £ are small, the approximate mean is then nearly 
n/t, = 6, and the approximate variance is then nearly n/t? = =6/n=V, (0) /n. S| 


Prussian Army Deaths. In Example 7.3.14, we found the posterior distribution of 0, 
the mean number of deaths per year by horsekick in Prussian army units based 
on a sample of 280 observations. The posterior distribution was found to be the 
gamma distribution with parameters 196 and 280. By the same argument used in 


Figure 7.8 Posterior p.d.f. 


together with p.d.f. of M.L.E. 


and approximating normal 
p.d.f. in Example 7.6.13. For 
the p.d.f of the M.L.E., the 
value of 6 = 3/6.6 is used to 
make the p.d.f’s as similar as 
possible. 


Example 
7.6.13 


Example 
7.6.14 
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Posterior 


--- Normal 


Example 7.6.11, this gamma distribution is approximately the distribution of the sum 
of 196 i.i.d. exponential random variables with parameter 280. The distribution of 
this sum is approximately the normal distribution with mean 196/280 and variance 
196/280. 

Using the same data as in Example 7.3.14, we can find the M.L.E. of @, which is the 
average of the 280 observations (according to Exercise 5 in Sec. 7.5). The distribution 
of the average of 280 1.1.d. Poisson random variables with mean 6 is approximately 
the normal distribution with mean @ and variance 6/280 according to the central limit 
theorem. We then have V,(6) = @ in the earlier notation. The maximum likelihood 
estimate with the observed data is 6 = 196/280 the mean of the posterior distribution. 
The variance of the posterior distribution is also V,, (6) /n = 6/280. < 


There are two common situations in which posterior distributions and distri- 
butions of M.L.E.’s are not such similar normal distributions as in the preceding 
discussion. One is when the sample size is not very large, and the other is when the 
likelihood function is not smooth. An example with small sample size is our electronic 
components example. 


Lifetimes of Electronic Components. In Example 7.3.12, we have a sample of n =3 
exponential random variables with parameter 6. The posterior distribution found 
there was the gamma distribution with parameters 4 and 8.6. The M.L.E. is 6 = 
3/(X1 + X> + X3), which has the distribution of 1 over a gamma random variable 
with parameters 3 and 30. Figure 7.8 shows the posterior p.d.f. along with the p.d.f. 
of the M.L.E. assuming that 6 = 3/6.6, the observed value of the M.L.E. The two 
p.d.f.’s, although similar, are still different. Also, both p.d-f’s are similar to, but still 
different from, the normal p.d.f. with the same mean and variance as the posterior, 
which also appears on the plot. S| 


An example of an unsmooth likelihood function involves the uniform distribu- 
tion on the interval [0, 6]. 


Sampling from a Uniform Distribution. In Example 7.5.7, we found the M.L.E. of 6 
based on a sample of size n from the uniform distribution on the interval [0, 6]. The 
M.L.E. is 6 = max{Xj,..., X,}. We can find the exact distribution of 6 using the 
result in Example 3.9.6. The p.d.f. of Y = is 


8n(v10) = nL F(A)" F189), (7.6.11) 
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where f(-|9) is the p.d-f. of the uniform distribution on [0, 6] and F(-|@) is the 
corresponding c.d.f. Substituting these well-known functions into Eq. (7.6.11) yields 
the p.d.f. of Y = 6: 

n-1 1 yrol 


aed |e = 
(1) =n |=] ed gn > 


for 0 < y < 0. This p.d-f. is not the least bit like a normal p.d-f. It is very asymmetric 
and has its maximum at the largest possible value of the M.L.E. In fact, one can 
compute the mean and variance of 0, respectively, as 


PoOy=—"-@. 
n+1 
n 2 


(n + 1)?(n + 2) 


The variance goes down like 1/n? instead of like 1/n in the approximately normal 
examples we saw earlier. 

If n is large, the posterior distribution of 6 will have a p.d_-f. that is approximately 
the likelihood function times the constant needed to make it into a p.d.f. The likeli- 
hood is in Eq. (7.5.8). Integrating that function over @ to obtain the needed constant 
leads to the following approximate posterior p.d-f. of 6: 


Var(6) = 


—_ 4)pn-1 7 
—a_e for 6 > 6, 


0 otherwise. 


C(@|x) © 


The mean and variance of this approximate posterior distribution are, respectively, 
(n — 16/(n — 2) and (n — 1)62/[(n — 2)?(n — 3)]. The posterior mean is still nearly 
equal to the M.L.E. (but a little larger), and the posterior variance decreases at a 
rate like 1/n”, as does the variance of the M.L.E. But the posterior distribution is not 
the least bit normal, as the p.d.f. has its maximum at the smallest possible value of 6 
and decreases from there. < 


The EM Algorithm 


There are a number of complicated situations in which it is difficult to compute the 
M.L.E. Many of these situations involve forms of missing data. The term “missing 
data” can refer to several different types of information. The most obvious would be 
observations that we had planned or hoped to observe but were not observed. For 
example, imagine that we planned to collect both heights and weights for a sample of 
athletes. For reasons that might be beyond our control, it is possible that we observed 
both heights and weights for most of the athletes, but only heights for one subset of 
atheletes and only weights for another subset. If we model the heights and weights 
as having a bivariate normal distribution, we might want to compute the M.L.E. of 
the parameters of that distribution. For a complete collection of pairs, Exercise 24 
in this section gives formulas for the M.L.E. It is not difficult to see how much more 
complicated it would be to compute the M.L.E. in the situation described above with 
missing data. 

The EM algorithm is an iterative method for approximating M.L.E.’s when 
missing data are making it difficult to find the M.L.E.’s in closed form. One begins 
(as in most iterative procedures) at stage 0 with an initial parameter vector 0. To 
move from stage j to stage j + 1, one first writes the full-data log-likelihood, which 
is what the logarithm of the likelihood function would be if we had observed the 


Example 
7.6.15 
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missing data. The values of the missing data appear in the full-data log-likelihood as 
random variables rather than as observed values. The “E” step of the EM algorithm 
is the following: Compute the conditional distribution of the missing data given 
the observed data as if the parameter 6 were equal to 0, and then compute the 
conditional mean of the full-data log-likelihood treating 6 as constant and the missing 
data as random variables. The E step gets rid of the unobserved random variables 
from the full-data log-likelihood and leaves 6 where it was. For the “M” step, choose 
6%+) to maximize the expected value of the full-data log-likelihood that you just 
computed. The M step takes you to stage j + 1. Ideally, the maximization step is no 
harder than it would be if the missing data had actually been observed. 


Heights and Weights. Suppose that we try to observe n = 6 pairs of heights and 
weights, but we get only three complete vectors plus one lone weight and two lone 
heights. We model the pairs as bivariate normal random vectors, and we want to 
find the M.L.E. of the parameter vector ((1, 42, Cre a, p). (This example is for 
illustrative purposes. One cannot expect to get a good estimate of a five-dimensional 
parameter vector with only nine observed values and no prior information.) The 
data are in Table 7.1. The missing weights are X4 and X55. The missing height 
is X61. The full-data log-likelihood is the sum of the logarithms of six expressions 
of the form Eq. (5.10.2) each with one of the rows of Table 7.1 substituted for the 
dummy variables (x;, x7). For example, the term corresponding to the fourth row of 
Table 7.1 is 


1 , i ea) 
— log(2 led = 2) == 
og(2mra102) — 5 log — p?) — 57 ( 7 


68 — X42 — M2 X42 —b2\? 
2p (Bast) (At) 4 (= | 
O71 07 02 
As an initial parameter vector we choose a naive estimate computed from the ob- 


served data: 


09 = (uO, uw, of, 03, p©) = (69.60, 194.75, 2.87, 14.82, 0.1764). 


(7.6.12) 


This consists of the M.L.E.’s based on the marginal distributions of the two coor- 
dinates, together with the sample correlation computed from the three complete 
observations. 


Table 7.1 Heights and weights for Exam- 
ple 7.6.15. The missing values are 
given random variable names. 


Height Weight 
72 197 
70 204 
73 208 
68 X40 
65 X50 


Xe 170 
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Example 
7.6.16 


The E step pretends that 6 = 6 and computes the conditional mean of the full- 
data log-likelihood given the observed data. For the fourth row of Table 7.1, the 
conditional distribution of X4 given the observed data and 6 = 6 can be found 
from Theorem 5.10.4 to be the normal distribution with mean 


68 — 69.60 
2.871/2 


and variance (1 — 0.17647)14.82? = 212.8. The conditional mean of (X42 — po)” 
would then be 212.8 + (193.3 — 1)”. The conditional mean of the expression in 
(7.6.12) would then be 


1 1 =, 
— log(2 log(1 — p” 
og Qn 0109) — 5 log(l — p°) — 57 ( _ 


- € = Hs) (m3 = M2) i (3 = He) 212.8] 
O71 02 02 oe 


The point to notice about this last expression is that, except for the last term 212.8/ Oe, 
it is exactly the contribution to the log-likelihood that we would have obtained if X45 
had been observed to equal 193.3, its conditional mean. Similar calculations can be 
done for the other two observations with missing coordinates. Each will produce 
a contribution to the log-likelihood that is the conditional variance of the missing 
coordinate divided by its variance plus what the log-likelihood would have been if the 
missing value had been observed to equal its conditional mean. This makes the M step 
almost identical to finding the M.L.E. for a completely observed data set. The only 
difference from the formulas in Exercise 24 is the following: For each observation 
that is missing X, add the conditional variance of X given Y to }7"_,(X; — X,,)* in 


194.75 + 0.1764 x (14.82)!/? ( ) = 193.3 


both the formula for ot and /. Similarly, for each observation that is missing Y, add 


the conditional variance of Y given X to 7"_,(Y; — Y,,)* in both the formula for o5 
and /. 

We now illustrate the first iteration of the EM algorithm with the data of this 
example. We already have 6, and we can compute the log-likelihood function 
from the observed data at @ as —31.359. To begin the algorithm, we have already 
computed the conditional mean and variance of the missing second coordinate from 
the fourth row of Table 7.1. The corresponding conditional means and variances for 
the fifth and sixth rows are 190.6 and 212.8 for the fifth row and 68.76 and 7.98 for the 
sixth row. For the E step, we replace the missing observations by their conditional 
means and add the conditional variances to the sums of squared deviations. For the M 
step, we insert the values just computed into the formulas of Exercise 24 as described 
above. The new vector is 


a = (69.46, 193.81, 2.88, 14.83, 0.3742), 


and the log-likelihood is —31.03. After 32 iterations, the estimate and log-likelihood 
stop changing. The final estmate is 


99) — (68.86, 189.71, 3.15, 15.03, 0.8965), 
with log-likelihood —29.66. <l 


Mixture of Normal Distributions. A very popular use of the EM algorithm is in fitting 
mixture distributions. Let X;,..., X, be random variables such that each one is 
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sampled either from the normal distribution with mean jz; and variance o? (with 
probability p) or from the normal distribution with mean 7 and variance o7 (with 
probability 1 — p), where 11 < 4. The restriction that 4 < 2 is to make the model 
identifiable in the following sense. If jz, = j22 is allowed, then every value of p leads to 
the same joint distribution of the observable data. Also, if neither mean is constrained 
to be below the other, then switching the two means and changing p to 1 — p will 
produce the same joint distribution for the observable data. The restriction ju < 2 
ensures that every distinct parameter vector produces a different joint distribution 
for the observable data. 

The data in Fig. 7.4 have the typical appearance of a distribution that is a mixture 
of two normals with means not very far apart. Because we have assumed that the 
variances of the two distributions are the same, we will not have the problem that 
arose in Example 7.5.10. 

The likelihood function from observations X; =x, ..., X;, =X, 1S 


7|__P =@-a)’\ 1-? =O = Hy)? 
I] Laine exp ( 752 + On)\2q exp ( 552 )| . (7.6.13) 


i=l 


The parameter vector is 0 = (11, >, 0”, p), and maximizing the likelihood as written 
is a challenge. However, we can introduce missing observations Y;,..., Y,, where 
Y; =11f X; was sampled from the distribution with mean jp; and Y; = 0 if X; was 
sampled from the distribution with mean p>. The full-data log-likelihood can be 
written as the sum of the logarithm of the marginal p.f. of the missing Y data plus the 
logarithm of the conditional p.d.f. of the observed X data given the Y data. That is, 


PS Y; log(p) + ¢ = ‘) logd =p) = 5 log (2707) 
i= i (7.6.14) 


n 


1 
Fe2 C [¥icx: — 4)? +0-¥)@; - Hn) | : 


At stage j with estimate 6” of 6, the E step first finds the conditional distribution 
of Y;,..., Y, given the observed data and 6 = 0). Since (X, Yj), .--, (X,» Y,) are 
independent pairs, we can find the conditional distribution separately for each pair. 
The joint distribution of (X;, Y;) is a mixed distribution with p.f./p.d-f. 


pri(l = py 


. adh — 
FG, 310 = pt 


1 (2 y2 
exp ( aM) [via — wf)? + - we - uf"]). 


The marginal p.d.f. of X; is the ith factor in (7.6.13). It is straightforward to deter- 
mine that the conditional distribution of Y; given the observed data is the Bernoulli 
distribution with parameter 


‘ Oj — a)? 


qj = (jf) WD) ; 
j (x;—p)2 : (xj-H"")? 
pW exp ( a2) ) +(1 pS) exp (-“a*) 


(7.6.15) 
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Because the full-data log-likelihood is a linear function of the Y;’s, the E step simply 
replaces each Y; in (7.6.14) by q\” 


i 


S~ 4g log(p) + (: -» a”) log(1 — p) — - log(2707) 


i=l i=1 


. The result is 


, (7.6.16) 
1 ji 7 
55 [at oi — a)? + 0 = af? 0065 — H2)"). 
i=1 


Maximizing (7.6.16) is straightforward. Since p appears in only the first two terms, 


Os, Also, \/* is the weighted average 


of the X;’s with weights qi a: Similarly, jig +) is the weighted average of the X;’s with 


@) 


1 


we see that p\/+) is just the average of the q 


weights 1 — q,”’. Finally, 
: fe . ee F a 
o2UtD — [ai - ne y2 +(1-— gx; - ra | . (7.6.17) 
i=l 
We will illustrate the first E and M steps using the data in Example 7.3.10. For 
the initial parameter vector 6, we will let ce be the average of the 10 lowest 
observations and a be the average of the 10 highest observations. We set p = 1/2, 
and o? is the average of the sample variance of the 10 lowest observations and the 
sample variance of the 10 highest observations. This makes 


0 = (WO, uw, 0? p) = (-7.65, 7.36, 46.28, 0.5). 


For each of the 20 observed values x;, we compute qe For example, x19 = —4.0. 
According to (7.6.15), 


(—4.0+7.65)? 
(0) 0.5 exp (- 2x46.28 ) 


No = 2 2 
(—4.0+7.65) (—4.0-7.36) 
0.5 exp (- 2x 46.28 ) + 0.5 exp (- 2x 46.28 ) 


= 0.7774. 


A similar calculation for xg = 9.0 yields i” = 0.0489. The initial log-likelihood, cal- 


culated as the logarithm of (7.6.13), is —75.98. The average of the 20 a values is 
(0), 


i 


p) = 0.4402. The weighted average of the data values using the q 
(1) 
hyo 


s as weights is 
—7.736, and the weighted average using the 1 — qo”s is us” = 6.3068. Using 


i 

(7.6.17), we get o* = 56.5491. The log-likelihood rises to —75.19. After 25 iter- 
ations, the results settle on 9?) = (—21.9715, 2.6802, 48.6864, 0.1037) with a final 
log-likelihood of —72.84. The histogram from Fig. 7.4 is reproduced in Fig. 7.9 to- 


gether with the p.d.f. of an observation from the fitted mixture distribution, namely, 


0.1037 21.9715)? 
fx)= ( ee) 


ex 
(Qn x 48.6864)172 “*P \ 9 x 48.6864 


1 — 0.1037 a (x — 2.6802)2 
(22 x 48.6864)!/2 2 x 48.6864 } © 
In addition, the fitted p.d.f. based on a single normal distribution is also shown in 


Fig. 7.9. The mean and variance of that single normal distribution are 0.1250 and 
110.6809, respectively. <l 
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Number of foods 4 


30 20 10 0 10 20 


Laboratory calories minus label calories 


Figure 7.9 Histogram of data from Example 7.3.10 together 
with fitted p.d.f. from Example 7.6.16 (solid curve). The p.d.f. 
has been scaled up to match the fact that the histogram gives 
counts rather than an estimated p.d.f. Also, the dashed curve 
gives the estimated p.d.f. for a single normal distribution. 


One can prove that the log-likelihood increases with each iteration of the EM 
algorithm and that the algorithm converges to a local maximum of the likelihood 
function. As with other numerical maximization routines, it is difficult to guarantee 
convergence to a global maximum. 


o, 
“ 


Sampling Plans 


Suppose that an experimenter wishes to take observations from a distribution for 
which the p.f. or the p.d.f. is f(x|@) in order to gain information about the value 
of the parameter 6. The experimenter could simply take a random sample of a 
predetermined size from the distribution. Instead, however, he may begin by first 
observing a few values at random from the distribution and noting the cost and the 
time spent in taking these observations. He may then decide to observe a few more 
values at random from the distribution and to study all the values thus far obtained. 
At some point, the experimenter will decide to stop taking observations and will 
estimate the value of 9 from all the observed values that have been obtained up 
to that point. He might decide to stop because either he feels that he has enough 
information to be able to make a good estimate of 6 or he cannot afford to spend 
any more money or time on sampling. 

In this experiment, the number n of observations in the sample is not fixed 
beforehand. It is a random variable whose value may very well depend on the 
magnitudes of the observations as they are obtained. 

Suppose that an experimenter contemplates using a sampling plan in which, for 
every n, the decision of whether or not to stop sampling after n observations have 
been collected is a function of the n observations seen so far. Regardless of whether 
the experimenter chooses such a sampling plan or decides to fix the value of n before 
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any observations are taken, it can be shown that the likelihood function based on the 
observed values is proportional (as a function of @) to 


FS (x40)... fy |8). 


In such a situation, the M.L.E. of 6 will depend only on the likelihood function and 
not on what type of sampling plan is used. In other words, the value of 6 depends 


only on the values x,,..., x, that are actually observed and does not depend on the 
plan (if there was one) that was used by the experimenter to decide when to stop 
sampling. 


To illustrate this property, suppose that the intervals of time, in minutes, between 
arrivals of successive customers at a certain service facility are 1.1.d. random variables. 
Suppose also that each interval has the exponential distribution with parameter 6, 
and that a set of observed intervals X,,..., X, form a random sample from this 
distribution. It follows from Exercise 7 of Sec. 7.5 that the M.L.E. of @ will be 
6 =1/X,,. Also, since the mean ju of the exponential distribution is 1/6, it follows 
from the invariance property of M.L.E.’s that ji = X,,. In other words, the M.L.E. of 
the mean is the average of the observations in the sample. 

Consider now the following three sampling plans: 


1. An experimenter decides in advance to take exactly 20 observations, and the 
average of these 20 observations turns out to be 6. Then the M.L.E. of yu is 


fi=6. 
2. An experimenter decides to take observations X;, X7... until she obtains a 
value greater than 10. She finds that X; < 10 fori =1,...,19 and that X59 > 10. 


Hence, sampling terminates after 20 observations. If the average of these 20 
observations is 6, then the M.L.E. is again 4 = 6. 


3. An experimenter takes observations one at a time, with no particular plan in 
mind, until either she is forced to stop sampling or she gets tired of sampling. 
She is certain that neither of these causes (being forced to stop or getting tired) 
depends in any way on uw. If for either reason she stops as soon as she has taken 
20 observations and if the average of the 20 observations is 6, then the M.L.E. 
is again i = 6. 


Sometimes, an experiment of this type must be terminated during an interval 
when the experimenter is waiting for the next customer to arrive. If a certain amount 
of time has elapsed since the arrival of the last customer, this time should not be 
omitted from the sample data, even though the full interval to the arrival of the next 
customer has not been observed. Suppose, for example, that the average of the first 20 
observations is 6, the experimenter waits another 15 minutes but no other customer 
arrives, and then she terminates the experiment. In this case, we know that the M.L.E. 
of 4 would have to be greater than 6, since the value of the 21st observation must 
be greater than 15, even though its exact value is unknown. The new M.L.E. can 
be obtained by multiplying the likelihood function for the first 20 observations by 
the probability that the 21st observation is greater than 15, namely, exp(—150), and 
finding the value of 6 that maximizes this new likelihood function (see Exercise 15). 

Remember that the M.L.E. is determined by the likelihood function. The only 
way in which the M.L.E. is allowed to depend on the sampling plan is through the 
likelihood function. If the decision about when to stop observing data is based solely 
on the observations seen so far, then this information has already been included in 
the likelihood function. If the decision to stop is based on something else, one needs 
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to evaluate the probability of that “something else” given each possible value of 6 
and include that probability in the likelihood. 
Other properties of M.L.E.’s will be discussed later in this chapter and in Chap- 
ter 8. 
o, 


“9 


Summary 


The M.L.E. of a function g() is (6), where 6 is the M.L.E. of 6. For example, if 6 is 
the rate at which customers are served in a queue, then 1/6 is the average service time. 
The M.L.E. of 1/6 is 1 over the M.L.E. of 8. Sometimes we cannot find a closed form 
expression for the M.L.E. of a parameter and we must resort to numerical methods to 
find or approximate the M.L.E. In most problems, the sequence of M.L.E.’s, as sample 
size increases, converges in probability to the parameter. When data are collected in 
such a way that the decision to stop collecting data is based solely on the data already 
observed or on other considerations that are not related to the parameter, then the 
M.L.E. will not depend on the sampling plan. That is, if two different sampling plans 
lead to proportional likelihood functions, then the value of 6 that maximizes one 


likelihood will also maximize the other. 


Exercises 


1. Suppose that X,,..., X,, form arandom sample from a 
distribution with the p.d.f. given in Exercise 10 of Sec. 7.5. 
Find the M.L.E. of e~”*. 


2. Suppose that X,,..., X, form a random sample from 
a Poisson distribution for which the mean is unknown. 
Determine the M.L.E. of the standard deviation of the 
distribution. 


3. Suppose that X,..., X, form a random sample from 
an exponential distribution for which the value of the 
parameter 6 is unknown. Determine the M.L.E. of the 
median of the distribution. 


4. Suppose that the lifetime of a certain type of lamp 
has an exponential distribution for which the value of the 
parameter 6 is unknown. A random sample of n lamps 
of this type are tested for a period of T hours and the 
number X of lamps that fail during this period is observed, 
but the times at which the failures occurred are not noted. 
Determine the M.L.E. of 6 based on the observed value 
of X. 


5. Suppose that X,,..., X, form a random sample from 
the uniform distribution on the interval [a, b], where both 
endpoints a and b are unknown. Find the M.L.E. of the 
mean of the distribution. 


6. Suppose that X;,..., X, form a random sample from 
a normal distribution for which both the mean and the 
variance are unknown. Find the M.L.E. of the 0.95 quan- 


tile of the distribution, that is, of the point 6 such that 
Pr(X <0) =0.95. 


7. For the conditions of Exercise 6, find the M.L.E. of 
v= Pr(X > 2). 


8. Suppose that X,,..., X,, form a random sample from 
a gamma distribution for which the p.d-f. is given by 
Eq. (7.6.2). Find the M.L.E. of T'’(a)/T(@). 


9. Suppose that X,..., X,, form a random sample from 
a gamma distribution for which both parameters a and f 
are unknown. Find the M.L.E. of a/B. 


10. Suppose that Xj, ..., X,, form a random sample from 
a beta distribution for which both parameters a and 6 are 
unknown. Show that the M.L.E.’s of a and £ satisfy the 
following equation: 


r(@) (py) 1 X; 
—_ 1 
l(a) r(p) a eB o 


11. Suppose that X;,..., X,, form a random sample of 
size n from the uniform distribution on the interval [0, 6], 
where the value of 6 is unknown. Show that the sequence 
of M.L.E.’s of 6 is a consistent sequence. 


12. Suppose that Xj, ..., X,, form a random sample from 
an exponential distribution for which the value of the pa- 
rameter f is unknown. Show that the sequence of M.L.E.’s 
of 6 is a consistent sequence. 
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13. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d.f. is as specified in Exer- 
cise 9 of Section 7.5. Show that the sequence of M.L.E.’s 
of @ is a consistent sequence. 


14. Suppose that a scientist desires to estimate the pro- 
portion p of monarch butterflies that have a special type 
of marking on their wings. 


a. Suppose that he captures monarch butterflies one at 
a time until he has found five that have this special 
marking. If he must capture a total of 43 butterflies, 
what is the M.L.E. of p? 


b. Suppose that at the end of a day the scientist had 
captured 58 monarch butterflies and had found only 
three with the special marking. What is the M.L.E. 
of p? 


15. Suppose that 21 observations are taken at random 
from an exponential distribution for which the mean ju is 
unknown (jz > 0), the average of 20 of these observations 
is 6, and although the exact value of the other observation 
could not be determined, it was known to be greater than 
15. Determine the M.L.E. of wp. 


16. Suppose that each of two statisticians A and B must 
estimate a certain parameter 6 whose value is unknown 
(6 > 0). Statistician A can observe the value of a random 
variable X, which has the gamma distribution with pa- 
rameters a and f, where aw =3 and f = 8; statistician B 
can observe the value of a random variable Y, which has 
the Poisson distribution with mean 20. Suppose that the 
value observed by statistician A is X = 2 and the value ob- 
served by statistician B is Y = 3. Show that the likelihood 
functions determined by these observed values are pro- 
portional, and find the common value of the M.L.E. of 6 
obtained by each statistician. 


17. Suppose that each of two statisticians A and B must 
estimate a certain parameter p whose value is unknown 
(0 < p <1). Statistician A can observe the value of a ran- 
dom variable X, which has the binomial distribution with 
parameters n = 10 and p; statistician B can observe the 
value of a random variable Y, which has the negative bi- 
nomial distribution with parameters r = 4 and p. Suppose 
that the value observed by statistician A is X = 4 and the 
value observed by statistician B is Y = 6. Show that the 
likelihood functions determined by these observed val- 
ues are proportional, and find the common value of the 
M.L.E. of p obtained by each statistician. 


18. Prove that the method of moments estimator for the 
parameter of a Bernoulli distribution is the M.L.E. 


19. Prove that the method of moments estimator for the 
parameter of an exponential distribution is the M.L.E. 


20. Prove that the method of moments estimator of the 
mean of a Poisson distribution is the M.L.E. 


21. Prove that the method of moments estimators of the 
mean and variance of a normal distribution are also the 
M.L.E.’s. 


22. Let X;,..., X, be arandom sample from the uniform 
distribution on the interval [0, 6]. 


a. Find the method of moments estimator of 0. 


b. Show that the method of moments estimator is not 
the M.L.E. 


23. Suppose that X;,..., X,, form arandom sample from 
the beta distribution with parameters a and £. Let 6= 
(a, B) be the vector parameter. 


a. Find the method of moments estimator for 6. 


b. Show that the method of moments estimator is not 
the M.L.E. 


24. Suppose that the two-dimensional vectors (Xj, Y;), 
(Xo, Y2),..., (Xy, Y,) form a random sample from a bi- 
variate normal distribution for which the means of X and 
Y, the variances of X and Y, and the correlation between 
X and Y are unknown. Show that the M.L.E.’s of these five 
parameters are as follows: 


fy,=X, and fo=Y,, 
aa 1 n exe n 
o?=-) (X;-X,)* and of =-)(%-¥,)’, 
ae i=l 
i L1G — XG; = ¥ 
p= 1/2° 


peace prac 


Hint: First, rewrite the joint p.d-f. of each pair (X;, Y;) as 
the product of the marginal p.d.f. of X; and the conditional 
p.d.f. of Y; given X;. Second, transform the parameters to 
MM, oy and 


ene 
ae 
O71 
pon 
p= : 
O71 


ey =(1- p)o5. 


Third, maximize the likelihood function as a function of 
the new parameters. Finally, apply the invariance prop- 
erty of M.L.E.’s to find the M.L.E.’s of the original pa- 
rameters. The above transformation greatly simplifies the 
maximization of the likelihood. 


25. Consider again the situation described in Exercise 24. 
This time, suppose that, for reasons unrelated to the val- 
ues of the parameters, we cannot observe the values of 
Yn—k+1s +++» Y,. That is, we will be able to observe all of 
Xy,...,X, and Yj,..., Y, x, but not the last k Y values. 
Using the hint given in Exercise 24, find the M.L.E.’s of 


M4; M2, 07, 05, and p. 


Example 
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* 7.7 Sufficient Statistics 


In the first six sections of this chapter, we presented some inference methods that 
are based on the posterior distribution of the parameter or on the likelihood 
function alone. There are other inference methods that are based neither on the 
posterior distribution nor on the likelihood function. These methods are based on 
the conditional distributions of various functions of the data (i.e., statistics) given 
the parameter. There are many statistics available in a given problem, some more 
useful than others. Sufficient statistics turn out to be the most useful in some sense. 


Definition of a Sufficient Statistic 


Lifetimes of Electronic Components. In Examples 7.4.8 and 7.5.2, we computed esti- 
mates of the mean lifetime for electronic components based on a sample of size three 
from the distribution of lifetimes. The two estimates we computed were a Bayes es- 
timate (Example 7.4.8) and an M.L.E. (Example 7.5.2). Both estimates made use of 
the observed data solely through the value of the statistic X; + X7 + X3. Is there any- 
thing special about this statistic, and if so, do such statistics exist in other problems? 

< 


In many problems in which a parameter 6 must be estimated, it is possible to 
find either an M.L.E. or a Bayes estimator that will be suitable. In some problems, 
however, neither of these estimators may be suitable or available. There may not 
be any M.L.E., or there may be more than one. Even when an M.L.E. is unique, 
it may not be a suitable estimator, as in Example 7.5.7, where the M.L.E. always 
underestimates the value of 6. Reasons why there may not be a suitable Bayes 
estimator were presented at the end of Sec. 7.4. In such problems, the search for 
a good estimator must be extended beyond the methods that have been introduced 
thus far. In this section, we shall define the concept of a sufficient statistic, which was 
introduced by R. A. Fisher in 1922, and we shall show how this concept can be used 
to simplify the search for a good estimator in many problems. 

Suppose that in a specific estimation problem, two statisticians A and B must 
estimate the value of the parameter 6. Statistician A can observe the values of the 
observations X;,..., X, ina random sample, and statistician B cannot observe the 
individual values of X,,..., X,, but can learn the value of a certain statistic T = 
r(X1,..., X,).In this case, statistician A can choose any function of the observations 
X,,..., X, as an estimator of 6 (including a function of 7). But statistician B can use 
only a function of T. Hence, it follows that A will generally be able to find a better 
estimator than will B. 

In some problems, however, B will be able to do just as well as A. In such a 
problem, the single function T =r(X,,..., X,,) will in some sense summarize all 
the information contained in the random sample, and knowledge of the individual 
values of X;,..., X,, will be irrelevant in the search for a good estimator of 6. A 
statistic T having this property is called a sufficient statistic. The formal definition of 
a sufficient statistic is based on the following intuition. Suppose that one could learn 
T and were then able to simulate random variables X a rte. 4 é such that, for every 
6, the joint distribution of X},..., Xj, was exactly the same as the joint distribution 
of X,,..., X,. Such a statistic T is sufficient in the sense that one could, if one felt 
the need, use xX} ee Xx} in the same way that one would have used X,..., X,,. The 
process of simulating X/,..., x’ is called an auxiliary randomization. 
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Definition 
7.7.1 


Sufficient Statistic. Let X,,..., X, be arandom sample from a distribution indexed 
by a parameter 0. Let T be a statistic. Suppose that, for every 6 and every possible 
value t of T, the conditional joint distribution of X;,..., X,, given that T =tr (and 
0) depends only on ¢ but not on 6. That is, for each r, the conditional distribution of 
X1,...,X, given T =t and @ is the same for all 6. Then we say that T is a sufficient 
statistic for the parameter 0. 


Return now to the intuition introduced right before Definition 7.7.1. When 
one simulates X - ee. 6 ; in accordance with the conditional joint distribution of 
X1,..., X, given T =1, it follows that for each given value of 6 € Q, the joint distri- 
bution of T, Xi ie va will be the same as the joint distribution of T, X;,..., X,,. By 
integrating out (or summing out) T from the joint distribution, we see that the joint 
distribution of X;, ..., X,, is the same as the joint distribution of X‘, ..., X/. Hence, 
if statistician B can observe the value of a sufficient statistic T, then she can generate 
n random variables X 12 ree, 6 i which have the same joint distribution as the origi- 
nal random sample X,,..., X,,. The property that distinguishes a sufficient statistic 
T from a statistic that is not sufficient may be described as follows: The auxiliary 
randomization used to generate the random variables X - Sears Xx} after the sufficient 
statistic T has been observed does not require any knowledge about the value of 0, 
since the conditional joint distribution of X;,..., X,, when T is given does not depend 
on the value of 6. If the statistic T were not sufficient, this auxiliary randomization 
could not be carried out, because the conditional joint distribution of Xj, ..., X,, for 
a given value of T would involve the value of 6, and this value is unknown. 

If statistician B is concerned solely with the distribution of the estimator she 
uses, we can now see why she can estimate @ just as well as can statistician A, 
who observes the values of X;,..., X,. Suppose that A plans to use a particular 
estimator 5(X,,..., X,,) to estimate 6, and B observes the value of T and generates 
Xj,-.-,X’, which have the same joint distribution as X;,..., X,. If B uses the 
estimator 6(X},..., X/), then it follows that the probability distribution of B’s 
estimator will be the same as the probability distribution of A’s estimator. This 
discussion illustrates why, when searching for a good estimator, a statistician can 
restrict the search to estimators that are functions of a sufficient statistic T. We shall 
return to this point in Sec. 7.9. 

On the other hand, if statistician B is interested in basing her estimator on 
the posterior distribution of 9, we have not yet shown why she can do just as well 
as statistician A. The next result (the factorization criterion) shows why even this 
is true. A sufficient statistic is sufficient for being able to compute the likelihood 
function, and hence it is sufficient for performing any inference that depends on the 
data only through the likelihood function. M.L.E.’s and anything based on posterior 
distributions depend on the data only through the likelihood function. 


The Factorization Criterion 


Immediately after Example 7.2.7 and Theorems 7.3.2 and 7.3.3, we pointed out that 
a particular statistic was used to compute the posterior distribution being discussed. 
These statistics all had the property that they were all that was needed from the 
data to be able to compute the likelihood function. This property is another way to 
characterize sufficient statistics. We shall now present a simple method for finding a 
sufficient statistic that can be applied in many problems. This method is based on the 
following result, which was developed with increasing generality by R. A. Fisher in 
1922, J. Neyman in 1935, and P. R. Halmos and L. J. Savage in 1949. 


Theorem 
7.7.1 
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Factorization Criterion. Let X;,..., X,, form arandom sample from either a continu- 
ous distribution or a discrete distribution for which the p.d.f. or the p.f. is f(x|@), 
where the value of 6 is unknown and belongs to a given parameter space Q. A 


statistic T =r(X,,..., X,) is a sufficient statistic for @ if and only if the joint p.d.-f. 
or the joint p-f. f,(%|0) of X;,..., X, can be factored as follows for all values of 
X= (x1,...,X,) € R” and all values of 6 € Q: 

Sr(x|0) = u(x)v[r(x), A]. (771) 


Here, the functions u and v are nonnegative, the function u may depend on x but does 
not depend on @, and the function v will depend on @ but depends on the observed 
value x only through the value of the statistic r(x). 


Proof We shall give the proof only when the random vector X = (Xj,..., X,) has 
a discrete distribution, in which case 


f,(x|0) = Pr(X = x6). 


Suppose first that f, (x|@) can be factored as in Eq. (7.7.1) for all values of x € R" 
and 6 € Q. For each possible value t of T, let A(t) denote the set of all points x € R” 
such that r(x) = t. For each given value of @ € Q, we shall determine the conditional 
distribution of X given that T = tf. For every point x € A(f), 


PrX=x/0) fn) 
Pr(T=110)  Dyeawy fn Ql) 


Since r(y) = ¢ for every point y € A(t), and since x € A(t), it follows from Eq. (7.7.1) 
that 


Pr(X =x|T =t, 0) = 


u(x) 


Pr(X =x|T =t, 6) = ———_.. (7.7.2) 
yeas u(y) 
Finally, for every point x that does not belong to A(t), 
Pr(X =x|T =1, 6) =0. (7.7.3) 


It can be seen from Eqs. (7.7.2) and (7.7.3) that the conditional distribution of X does 
not depend on @. Therefore, T is a sufficient statistic. 

Conversely, suppose that T is a sufficient statistic. Then, for every given value 
t of T, every point x € A(t), and every value of 6 € , the conditional probability 
Pr(X =x|T =t, 6) will not depend on 6 and will therefore have the form 


Pr(X =x|T =t, 0) =u(x). 
If we let v(t, 0) = Pr(T = 10), it follows that 
f,(x|0) = Pr(X =x|9) = Pr(X¥ =x|T =1¢, 0) Pr(T =1|0) 


=u(x)v(t, 0). 
Hence, f,,(x|0) has been factored in the form specified in Eq. (7.7.1). 
The proof for a random sample X;,..., X, from a continuous distribution 
requires somewhat different methods and will not be given here. o 


One way to read Theorem 7.7.1 is that T = r(X) is sufficient if and only if the like- 
lihood function is proportional (as a function of @) to a function that depends on the 
data only through r(x). That function would be v[r(x), 8]. When using the likelihood 
function for finding posterior distributions, we saw that any factor not depending on 
6 (such as u(x) in Eq. (7.7.1)) can be removed from the likelihood without affecting 
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Example 
7.7.4 


the calculation of the posterior distribution. So, we have the following corollary to 
Theorem 7.7.1. 


A statistic T = r(X) is sufficient if and only if, no matter what prior distribution we 
use, the posterior distribution of 6 depends on the data only through the value of T. 
rT 


For each value of x for which f,,(x|@) = 0 for all values of 6 € Q, the value of the 
function u(x) in Eq. (7.7.1) can be chosen to be 0. Therefore, when the factorization 
criterion is being applied, it is sufficient to verify that a factorization of the form 
given in Eq. (7.7.1) is satisfied for every value of x such that f,(x|0) > 0 for at least 
one value of 6 € Q. 

We shall now illustrate the use of the factorization criterion by giving four 
examples. 


Sampling from a Poisson Distribution. Suppose that X = (X,,..., X,,) forma random 
sample from a Poisson distribution for which the value of the mean 6 is unknown 
(0 > 0). Let r(x) = )07_, x;. We shall show that T =r(X) = }7"_, X; is a sufficient 
statistic for 6. 

For every set of nonnegative integers x1, ..., x,, the joint p-f. f,(v|@) of X1,..., 
X,, is as follows: 


n 


e 99% “ 1 —né gr (x) 
fr(xl0) =|] = = Il= e 6 4 


i=1 i ae 


Let u(x) = []_,(1/x;!) and v(t, 0) = e~""6". We now see that f,,(x|@) has been fac- 
tored as in Eq. (7.7.1). It follows that T = °""_, X; is a sufficient statistic for@. << 


Applying the Factorization Criterion to a Continuous Distribution. Suppose that X¥ = 


(X,,..., X,,) form a random sample from a continuous distribution with the follow- 
ing p.d.f.: 
F(le) = 6x9! for 0 <x< i, 
0 otherwise. 


It is assumed that the value of the parameter 6 is unknown (6 > 0). Letr(x) = ig eae Xp 
We shall show that T =r(X) = ie X; 1s a sufficient statistic for 6. 
For 0 <x; <1(@=1,...,n), the joint p.d-f f,(v|0) of X;,..., X,, is as follows: 


i g-1 

f(x|0) =0" (1 “| =6" [rx Pot. (7.7.4) 

i=1 

Furthermore, if at least one value of x; is outside the interval 0 < x; < 1, then f,,(x|@) = 
0 for every value of 6 € Q. The right side of Eq. (7.7.4) depends on x only through 
the value of r(x). Therefore, if we let u(v) =1 and v(t, 0) = 6"t°—!, then f,,(x|@) in 
Eq. (7.7.4) can be considered to be factored in the form specified in Eq. (7.7.1). It 
follows from the factorization criterion that the statistic T =[]}_, X; is a sufficient 
statistic for 6. <l 


Sampling from a Normal Distribution. Suppose that X¥ = (X,..., X,,) form a random 
sample from a normal distribution for which the mean jz is unknown and the variance 
ois known. Let r(x) = or 47. We shall show that T =r(X) = )0"_, X; isasufficient 
statistic for pw. 


Example 
7.7.5 
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For —oo0 < x; < oo (i =1,..., 2), the joint p.d.f. of X is as follows: 
7 1 (x; — w)? 
(X|fL) = ex ! : LAS 
fu 1) I] nyo | (7.7.5) 


This equation can be rewritten in the form 


Srl) = : exp : oa? exp grees (7.7.6) 
n (27/2 202 = i o2 = ' 962 )° “ 


Let u(x) be the constant factor and the first exponential factor in Eq. (7.7.6). Let 
v(t, 2) = exp(ut/o* — nu?/o*). Then f,,(x|2) has now been factored as in Eq. (7.7.1). 
It follows from the factorization criterion that T = )°"_, X; is a sufficient statistic for 
LL. < 


Since )~"_, x; =nX,, we can state equivalently that the final factor in Eq. (7.7.6) 
depends on x;,..., x, only through the value of x,. Therefore, in Example 7.7.4 
the statistic X,, is also a sufficient statistic for 1. More generally (see Exercise 13 at 
the end of this section), every one-to-one function of a sufficient statistic is also a 
sufficient statistic. 


Sampling from a Uniform Distribution. Suppose that X¥ = (X,,..., X,,) form arandom 
sample from the uniform distribution on the interval [0, 6], where the value of the 
parameter 6 is unknown (6 > 0). Let r(x) = max{x;,..., x, }. We shall show that 
T =r(X) =max{X,..., X,,} is a sufficient statistic for 6. 

The p.d.f. f(x|@) of each individual observation X; is 


f (x10) = ; for0<x <8, 
0 otherwise. 


Therefore, the joint p.d-f. f,,(v|@) of X1,..., X, is 


1 : 
f,(el0) = ae ford em <0,(@i=1,...,n), 
0 otherwise. 
It can be seen that if x; < 0 for at least one value of i (i =1,..., 7), then f,,(x|@) =0 
for every value of @ > 0. Therefore, it is only necessary to consider the factorization 
of f,(x|9) for values of x; >0(@ =1,..., 7). 


Let v[t, 0] be defined as follows: 


1 : 
ofr. 0)={ ift <0, 


0 ift>8é. 
Notice that x; < @ fori =1,...,n if and only if max{x,,..., x,,} <6. Therefore, for 
x; >0@=1,...,), we can rewrite f,,(x|0) as follows: 
fn(x|@) = v[r (x), 4]. (7.7.7) 
Letting u(x) = 1, we see that the right side of Eq. (7.7.7) is in the form of Eq. (7.7.1). 
It follows that T = max{X), ..., X,,} is a sufficient statistic for 6. | 
Summary 


A statistic T = r(X) is sufficient if, for each r, the conditional distribution of X given 
T =t and 6 is the same for all values of 6. So, if T is sufficient, and one observed only 
T instead of X, one could, at least in principle, simulate random variables X’ with 
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the same joint distribution given @ as X. In this sense, T is sufficient for obtaining 
as much information about @ as one could get from X. The factorization criterion 
says that T = r(X) is sufficient if and only if the joint p.f. or p.d.f. can be factored as 
f (x|0) =u(x)v[r (x), 6] for some functions u and v. This is the most convenient way 
to identify whether or not a statistic is sufficient. 


Exercises 


Instructions for Exercises 1 to 10: In each of these ex- 
ercises, assume that the random variables X;,..., X, 
form a random sample of size n from the distribution 
specified in that exercise, and show that the statistic T 
specified in the exercise is a sufficient statistic for the 
parameter. 


1. The Bernoulli distribution with parameter p, which is 
unknown (0 < p< 1);T =)°y_, Xj. 


2. The geometric distribution with parameter p, which is 
unknown (0 < p <1); T =)0Y_, Xi. 


3. The negative binomial distribution with parameters r 
and p, where r is known and p is unknown (0 < p < 1); 


P= yee 


4, The normal distribution for which the mean yp is known 
and the variance o” > 0 is unknown; T = a a 


5. The gamma distribution with parameters a and £, 
where the value of a is known and the value of £ is un- 
known (6 > 0); T = X,. 


6. The gamma distribution with parameters a and £, 
where the value of f is known and the value of a is un- 
known (a > 0); T =[}_, X}- 


7. The beta distribution with parameters a and 6, where 
the value of £ is known and the value of a is unknown 
(a > 0); T =[T7_, X. 


8. The uniform distribution on the integers 1, 2,..., 0, 
as defined in Sec. 3.1, where the value of 6 is unknown 
(6 =1,2,...); T =max{X,..., Xy}. 


9. The uniform distribution on the interval [a, b], where 
the value of a is known and the value of b is unknown 
(b> a); T =max{Xj,..., X,}. 


10. The uniform distribution on the interval [a, b], where 
the value of b is known and the value of a is unknown 
(a <b); T =min{X),..., X,}-. 


11. Assume that X;,..., X, form a random sample from 
a distribution that belongs to an exponential family of 
distributions as defined in Exercise 23 of Sec. 7.3. Prove 
that T = )°"_, d(X;) is a sufficient statistic for 0. 


12. Suppose that a random sample Xj, ..., X,, is drawn 
from the Pareto distribution with parameters x9 and a. 
(See Exercise 16 in Sec. 5.7.) 


a. If x9 is known and a > 0 unknown, find a sufficient 
statistic. 


b. Ifa is known and x9 unknown, find a sufficient statis- 
tic. 


13. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d.f. is f (x|@), where the value 
of the parameter 6 belongs to a given parameter space Q. 
Suppose that T =r(X,,..., X,) and T’=r'(Xj,..., Xp) 
are two statistics such that 7’ is a one-to-one function of 
T; that is, the value of T’ can be determined from the 
value of T without knowing the values of X;,..., X,, and 
the value of T can be determined from the value of T’ 
without knowing the values of X;,..., X,,. Show that 7’ 
is a sufficient statistic for 6 if and only if T is a sufficient 
statistic for 6. 


14. Suppose that X;,..., X,, form a random sample from 
the gamma distribution specified in Exercise 6. Show that 
the statistic T = )°/_, log X; is a sufficient statistic for the 
parameter a. 


15. Suppose that Xj, ..., X,, form a random sample from 
the beta distribution with parameters a and 6, where the 
value of w is known and the value of 6 is unknown (6 > 0). 
Show that the following statistic T is a sufficient statistic 
for B: 


n 4 
1 1 
es y lo ; 
n (» 4 


16. Let 6 be a parameter with parameter space Q equal 
to an interval of real numbers (possibly unbounded). Let 
X have p.d-f. or p.f. f,,(x|@) conditional on 6. Let T =r(X) 
be a statistic. Assume that T is sufficient. Prove that, for 
every possible prior p.d.f. for 6, the posterior p.d-f. of @ 
given X = x depends on x only through r(x). 


17. Let 6 be a parameter, and let X be discrete with pf. 
J, (x|@) conditional on 6. Let T = r(X) be a statistic. Prove 
that T is sufficient if and only if, for every t and every x 
such that t = r(x), the likelihood function from obsery- 
ing T =¢ is proportional to the likelihood function from 
observing X = x. 


Example 
7.8.1 


Definition 
7.8.1 


Theorem 
7.8.1 


7.8 Jointly Sufficient Statistics 449 


*7.8 Jointly Sufficient Statistics 


When a parameter 0 is multidimensional, sufficient statistics will typically need to 
be multidimensional as well. Sometimes, no one-dimensional statistic is sufficient 
even when 60 is one-dimensional. In either case, we need to extend the concept of 
sufficient statistic to deal with cases in which more than one statistic is needed in 
order to be sufficient. 


Definition of Jointly Sufficient Statistics 


Sampling from a Normal Distribution. Return to Example 7.7.4, in which X = (Xj, ..., 
X,,) form arandom sample from the normal distribution with mean pz and variance o°. 
This time, assume that both coordinates of the parameter 0 = (uw, o) are unknown. 
The joint p.d-f. of X is still given by the right side of Eq. (7.7.5). But now, we would 
refer to the joint p.d-f. as f,,(x|@). With both and o? unknown, there no longer 


appears to be a single statistic that is sufficient. < 


We shall continue to suppose that the variables X;,..., X,, form a random sam- 
ple from a distribution for which the p.d_-f. or the p.f. is f(x|0), where the parameter 6 
must belong to some parameter space Q. However, we shall now explicitly consider 
the possibility that 9 may be a vector of real-valued parameters. For example, if the 
sample comes from a normal distribution for which both the mean py and the vari- 
ance o* are unknown, then @ would be a two-dimensional vector whose components 
are y and o?. Similarly, if the sample comes from a uniform distribution on some 
interval [a, b] for which both endpoints a and b are unknown, then 6 would be a two- 
dimensional vector whose components are a and b. We shall, of course, continue to 
include the possibility that 6 is a one-dimensional parameter. 

In almost every problem in which @ is a vector, as well as in some problems in 
which @ is one-dimensional, there does not exist a one-dimensional statistic T that is 
sufficient. In such a problem it is necessary to find two or more statistics T;,..., T; 
that together are jointly sufficient statistics in a sense that will now be described. 

Suppose that in a given problem the statistics T,, ..., T, are defined by k different 
functions of the vector of observations X = (Xj, ..., X,,). Specifically, let 7; = r;(X) 
fori =1,..., k. Loosely speaking, the statistics T;,..., T, are jointly sufficient statis- 
tics for 0 if a statistician who learns only the values of the k functions r,(X), ..., 7, (X) 
can estimate every component of 6 and every function of the components of 6, as 
well as one who observes the n individual values of X,,..., X,,. More formally, we 
have the following definition. 


Jointly Sufficient Statistics. Suppose that for each 6 and each possible value (t,, . .. , t;) 
of (T;, ..., Tj), the conditional joint distribution of (X1,..., X,,) given (7),..., T) = 
(tj, ..., t,) does not depend on 6. Then 7), ..., 7, are called jointly sufficient statistics 
for 6. 


A version of the factorization criterion exists for jointly sufficient statistics. The 
proof will not be given, but it is similar to the proof of Theorem 7.7.1. 


Factorization Criterion for Jointly Sufficient Statistics. Let ry, ...,r; be functions of n 
real variables. The statistics T; = r;(X),i =1,..., k, are jointly sufficient statistics for 
0 if and only if the joint p.d.f. or the joint p.f. f,(¥|@) can be factored as follows for 
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Example 
7.8.2 


Example 
7.8.3 


Example 
7.8.4 


all values of x €¢ R” and all values of 6 € Q: 
Sfr(x|0) =uw)v[ry(x), re), O]. (7.8.1) 


Here the functions u and v are nonnegative, the function vu may depend on x but does 
not depend on @, and the function v will depend on 6 but depends on x only through 
the k functions r;(x), ..., 7, (%). | 


Jointly Sufficient Statistics for the Parameters of a Normal Distribution. Suppose that 
X,,..., X, form a random sample from a normal distribution for which both the 
mean yw and the variance o” are unknown. The joint p.d.f. of X,..., X,, is given by 
Eq. (7.7.6), and it can be seen that this joint p.d.f. depends on x only through the 
values of )~"_, x; and )7"_, x?. Therefore, by the factorization criterion, the statistics 


T, = 0), X; and Ty = 7"_, X? are jointly sufficient statistics for z and 0”. 4 


Suppose now that in a given problem the statistics T,, . .. , J, are jointly sufficient 
statistics for some parameter vector 0. If k other statistics qT eee t are obtained 
from 7), ..., 7, by a one-to-one transformation, then it can be shown that T/,..., T; 
will also be jointly sufficient statistics for 0. 


Another Pair of Jointly Sufficient Statistics for the Parameters of a Normal Distribu- 
tion. Suppose again that X;,..., X, form a random sample from a normal distri- 
bution for which both the mean jz and the variance o? are unknown. Let T, = 4, the 


sample mean, and let T; = o”, the sample variance. Thus, 
= {= = 
=<, and = - > (x; - X,)’. 
i=l 
We shall show that 7; and T; are jointly sufficient statistics for 4. and a. 


Let 7, and T; be the jointly sufficient statistics for 1 and o? derived in Exam- 
ple 7.8.2. Then 


Also, equivalently, 
Tj=nTi and T=n(Ti+T,’). 


Hence, the statistics T/ and T; are obtained from the jointly sufficient statistics T; and 
T, by a one-to-one transformation. It follows, therefore, that T; and T, themselves 
are jointly sufficient statistics for and 0”. < 


We have now shown that the jointly sufficient statistics for the unknown mean 


and variance of a normal distribution can be chosen to be either 7; and 7, as given 
in Example 7.8.2, or T/ and T;, as given in Example 7.8.3. 


Jointly Sufficient Statistics for the Parameters of a Uniform Distribution. Suppose that 


X1,..., X, form a random sample from the uniform distribution on the interval 
[a, b], where the values of both endpoints a and b are unknown (a < b). The joint p.d.f. 
fr(xla, b) of X1,..., X, will be 0 unless all the observed values xj, ..., x, lie between 


a and b; that is, f,(x|a, b) =0 unless min{x),...,x,} >a and max{xy,...,x,} <b. 


Definition 
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Furthermore, for every vector x such that min{x;,..., x,} >aandmax{x;,...,x,}< 
b, we have 


1 
fila, b) = a" 


For each two numbers y and z, we shall let A(y, z) be defined as follows: 


1 fory<z, 
h(y, z)= 

0 fory>z. 

For every value of x € R”, we can then write 
hla, min{x,,..., x,}]h[max{x,,...,x,},b 
fr(xla, b) — la {x1 Xn} ] [ {x1 Xn} | 
(b— a)" 

Since this expression depends on x only through the values of min{x),..., x,} 
and max{x,,...,x,}, it follows that the statistics T,; = min{X,,..., X,} and T, = 
max{X,,..., X,,} are jointly sufficient statistics for a and b. < 


Minimal Sufficient Statistics 


In a given problem, we want to try to find a sufficient statistic or a set of jointly 
sufficient statistics for 6, because the values of such statistics summarize all the 
relevant information about 6 contained in the random sample. When a set of jointly 
sufficient statistics are known, the search for a good estimator of 6 is simplified 
because we need consider only functions of these statistics as possible estimators. 
Therefore, in a given problem it is desirable to find, not merely any set of jointly 
sufficient statistics, but the simplest set of jointly sufficient statistics. That is, we want 
the set of sufficient statistics that requires us to consider the smallest collection of 
posible estimators. (We make this more precise in Defintion 7.8.3.) For example, it 
is correct but completely useless to say that in every problem the n observations 
X1,..., X, are jointly sufficient statistics. 

We shall now describe another set of jointly sufficient statistics that exist in every 
problem and are slightly more useful. 


Order Statistics. Suppose that X,,..., X, form a random sample from some distri- 
bution. Let Y; denote the smallest value in the random sample, let Y, denote the next 
smallest value, let Y; denote the third smallest value, and so on. In this way, Y,, de- 
notes the largest value in the sample, and Y,,_; denotes the next largest value. The 
random variables Y;,..., Y,, are called the order statistics of the sample. 


Now let y; < y. <---< y, denote the values of the order statistics for a given 
sample. If we are told the values of y,,..., y,, then we know that these n values 
were obtained in the sample. However, we do not know which one of the observations 
X1,..., X,, actually yielded the value y,, which one actually yielded the value y5, and 
so on. All we know is that the smallest of the values of X,,..., X,, was y1, the next 
smallest value was y>, and so on. 


Order Statistics Are Sufficient in Random Samples. Let X;,..., X, form a random 
sample from a distribution for which the p.d-f. or the p-f. is f(x|@). Then the order 
statistics Y,,..., Y,, are jointly sufficient for 6. 
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Definition 
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Proof Let y; < yx <---< y, denote the values of the order statistics. The joint p.d-f. 
or joint p.f. of Xy,..., X, has the following form: 


fn(xl0) =] | £@;10). (7.8.2) 


i=l 


Since the order of the factors in the product on the right side of Eq. (7.8.2) is 
irrelevant, Eq. (7.8.2) could just as well be rewritten in the form 


frel0) =| | £0;10). 


i=l 


Hence, f,(x|@) depends on x only through the values of y,..., y,. It follows, there- 
fore, that the order statistics Y;,..., Y,, are jointly sufficient statistics for 0. o 


In words, Theorem 7.8.2 says that it is sufficient to know the set of n numbers that 
were obtained in the sample, and it is not necessary to know which particular one of 
these numbers was, for example, the value of X3. 

To see how the order statistic is simpler than the full data vector in the sense 
of having fewer possible estimators, note that X3 is an estimator based on the full 
data vector, but X3 cannot be determined from the order statistics. Hence X3 is not 
an estimator that we would need to consider if we based our inference on the order 
statistics. The same is true of all of the averages of the form (X;, + --- + X;,)/k for 
{ij,..., i} a proper subset of {1,...,}, as well as many other functions. On the 
other hand, every estimator based on the order statistics is also a function of the full 
data. 

In each of the examples that have been given in this section and in Sec. 7.7, we 
considered a distribution for which either there was a single sufficient statistic or there 
were two Statistics that were jointly sufficient. For some distributions, however, the 
order statistics Y;,..., Y,, are the simplest set of jointly sufficient statistics that exist, 
and no further reduction in terms of sufficient statistics is possible. 


Sufficient Statistics for the Parameter of a Cauchy Distribution. Suppose that X;,..., X, 
form a random sample from a Cauchy distribution centered at an unknown point 
0 (—0o < 6 < oo). The p.d.f. f (x|) of this distribution is given by Eq. (7.6.6), and the 
joint p.d.f. f,,(v|@) of X1,..., X,, is given by Eq. (7.6.7). It can be shown that the only 
jointly sufficient statistics that exist in this problem are the order statistics Y,,..., Y,, 
or some other set of n statistics T;, ... , 7,, that can be derived from the order statistics 
by a one-to-one transformation. The details of the argument will not be given here. 


<q 


These considerations lead us to the concepts of a minimal sufficient statistic and a 
minimal set of jointly sufficient statistics. A sufficient statistic T is a minimal sufficient 
statistic if every function of 7, which itself is a sufficient statistic, is a one-to-one 
function of T. Formally, we shall use the following definition, which is equivalent to 
the informal definition just given. 


Minimal (Jointly) Sufficient Statistic(s). A statistic T is a minimal sufficient statistic 
if T is sufficient and is a function of every other sufficient statistic. A vector T = 
(T,,..., T,) of statistics are minimal jointly sufficient statistics if the coordinates of 
T are jointly sufficient statistics and T is a function of every other jointly sufficient 
statistics. 


Theorem 
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In Example 7.8.5, the order statistics Y,,..., Y, are minimal jointly sufficient 
statistics. 


Maximum Likelihood Estimators and Bayes Estimators 
as Sufficient Statistics 


For the next two theorems, let X;,..., X,, form arandom sample from a distribution 
for which the p.f. or the p.d-f. is f (x|0), where the value of the parameter 6 is unknown 
and one-dimensional. 


M.L.E. and Sufficient Statistics. Let T=r(X),..., X,,) be a sufficient statistic for 6. 
Then the M.L.E. 6 of 6 depends on the observations X,,..., X,, only through the 
statistic 7. Furthermore, if 6 is itself sufficient, then it is minimal sufficient. 


Proof We show first that 6 is a function of every sufficient statistic. Let T =r(X) bea 
sufficient statistic. The factorization criterion Theorem 7.7.1 says that the likelihood 
function f,,(x|@) can be written in the form 


f(x) =u(x)v[r (x), 6]. 


The M.L.E. 6 is the value of @ for which f,(x|@) is a maximum. It follows, therefore, 
that 6 will be the value of @ for which v[r (x), 6]is amaximum. Since v[r (x), 6] depends 
on the observed vector x only through the function r(x), it follows that 6 will also 
depend on x only through the function r(x). Thus, the estimator 6 is a function of 
T=r(X). 

Since the estimator @ is a function of the observations X;,..., X,, and is not a 
function of the parameter 6, the estimator is itself a statistic. If 6 is actually a sufficient 
statistic, then it is minimal sufficient because we just showed that it is a function of 
every other sufficient statistic. rT] 


Theorem 7.8.3 can be extended easily to the case in which the parameter 0 is 
multidimensional. If 6 = (6), ..., 6.) is a vector of k real-valued parameters, then the 
M.L.E. vector (61, eat 6,) will depend on the observations X;,..., X,, only through 
the functions in a set of jointly sufficient statistics. If the vecotor of the estimators 
6,,..., 6, is a set of jointly sufficient statistics, then they are minimal jointly sufficient 
statistics because they are functions of every set of jointly sufficient statistics. 


Minimal Jointly Sufficient Statistics for the Parameters of a Normal Distribution. Suppose 
that X,,..., X,, form a random sample from a normal distribution for which both 
the mean y and the variance o* are unknown. It was shown in Example 7.5.6 that the 
M.L.E.’s fi and o? are the sample mean and the sample variance. Also, it was shown 
in Example 7.8.3 that ji and o? are jointly sufficient statistics. Hence, jz and o? are 
minimal jointly sufficient statistics. 4 


The statistician in Example 7.8.6 can restrict the search for good estimators of 
and o” to functions of minimal jointly sufficient statistics. It follows, therefore, from 
Example 7.8.6 that if the M.L.E.’s fi and o? themselves are not used as estimators 
of w and o”, the only other estimators that need to be considered are functions of i 
and o?. 

The results above concerning M.L.E.’s also pertain to Bayes estimators. 
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Theorem 
7.8.4 


Bayes Estimator and Sufficient Statistics. Let T =r(X) be a sufficient statistic for 
6. Then every Bayes estimator 6 of 6 depends on the observations X;,..., X;, 
only through the statistic 7. Furthermore, if 6 is itself sufficient, then it is minimal 
sufficient. 


Proof Let the prior p.d.f. or p.f. of 6 be &(6). It follows from relation (7.2.10) and 
the factorization criterion that the posterior p.d.f. €(@|x) will satisfy the following 
relation: 


C(Alx) x v[r(x), OE (6). 


It can be seen from this relation that the posterior p.d.f. of @ will depend on 
the observed vector x only through the value of r(x). Since the Bayes estimator of 
@ with respect to a specified loss function is calculated from this posterior p.d.f., the 
estimator also will depend on the observed vector x only through the value of r(x). In 
other words, the Bayes estimator is a function of T = r(X). Since the Bayes estimator 
6 is itself a statistic and is a function of every sufficient statistic T, if 6 is also sufficient, 
then it is minimal sufficient. | 


Theorem 7.8.4 also extends to vector parameters and jointly sufficient statistics. 


Summary 
Statistics T, =1r,(X),..., T, =1;,(X) are jointly sufficient if and only if the joint p.f. 
or p.d.f. can be factored as f,,(v|@) =u(x)v[r) (x), ..., 7,(¥), O], for some functions 


u and v. It is clear from this factorization that the original data X;,..., X, are 
jointly sufficient. In order to be useful, a sufficient statistic should be a simpler 
function than the entire data. A minimal sufficient statistic is the simplest function 
that is still sufficient; that is, it is a sufficient statistic that is a function of every 
sufficient statistic. Since the likelihood function is a function of every sufficient 
statistic, according to the factorization criterion, a sufficient statistic that can be 
determined from the likelihood function is minimal sufficient. In particular, if an 
M.L.E. or Bayes estimator is sufficient, then it is minimal sufficient. 


Instructions for Exercises 1 to 4: In each exercise, assume 
that the random variables Xj, ..., X,, formarandom sam- 
ple of size n from the distribution specified in the exercise, 
and show that the statistics T, and T, specified in the exer- 
cise are jointly sufficient statistics. 


1. A gamma distribution for which both parameters a 
and f are unknown (a > 0 and 6 > 0); T, =[]/_, X; and 
Th = pe Xj. 

2. A beta distribution for which both parameters a and 
B are unknown (a > 0 and 6 > 0); 7 =[]}_, X; and = 


4. The uniform distribution on the interval [0, 6 + 3], 
where the value of @ is unknown (—oo < @ < 00); Tj = 
min{X, 

..., X,} and 7, =max{X,..., X;}. 

5. Suppose that the vectors (X1, Yj), (Xo, Y),..., 
(X,,, Y,,) form a random sample of two-dimensional vec- 
tors from a bivariate normal distribution for which the 
means, the variances, and the correlation are unknown. 
Show that the following five statistics are jointly sufficient: 


Doped Sis pea Yas pa xi, ae Ve, and )7i_) Xi¥;. 


Tia a Xj). 


3. A Pareto distribution (see Exercise 16 of Sec. 5.7) 
for which both parameters x9 and @ are unknown (x9 > 
O and a > 0); T; = min{X,,..., X,} and T, =[T}_, X;. 


6. Consider a distribution for which the p.d-f. or the p.f. 
is f(x|0), where the parameter 6 is a k-dimensional vec- 
tor belonging to some parameter space Q. It is said that 
the family of distributions indexed by the values of @ in 


Q is a k-parameter exponential family, or a k-parameter 
Koopman-Darmois family, if f (x|@) can be written as fol- 
lows for 6 € Q and all values of x: 


k 
f(x|0) = a()b(x) =p cj we] 
i=1 
Here, a and cj, ..., cy are arbitrary functions of 6, and b 
and d;,..., d, are arbitrary functions of x. Suppose now 
that X;,..., X, formarandom sample from a distribution 
which belongs to a k-parameter exponential family of this 
type, and define the k statistics T,, ..., T, as follows: 


R=) d(X,) forta1,.1.5k. 
j=l 


Show that the statistics 7, .. 
statistics for 0. 


., T, are jointly sufficient 


7. Show that each of the following families of distribu- 
tions is a two-parameter exponential family as defined in 
Exercise 6: 


a. The family of all normal distributions for which both 
the mean and the variance are unknown 


b. The family of all gamma distributions for which both 
a and 6 are unknown 


c. The family of all beta distributions for which both a 
and # are unknown 


8. Suppose that X,,..., X, form a random sample from 
an exponential distribution for which the value of the 
parameter 6 is unknown (f > 0). Is the M.L.E. of 6B a 
minimal sufficient statistic? 


9. Suppose that X,,..., X, form a random sample from 
the Bernoulli distribution with parameter p, which is un- 
known (0 < p <1). Is the M.L.E. of p a minimal sufficient 
statistic? 


10. Suppose that Xj, ..., X, form a random sample from 
the uniform distribution on the interval [0, 0], where the 
value of @ is unknown (6 > 0). Is the M.L.E. of 6 a minimal 
sufficient statistic? 


7.9 Improving an Estimator 455 


11. Suppose that X;,..., X,, form a random sample from 
a Cauchy distribution centered at an unknown point 6 
(—oo < 6 < oo). Is the M.L.E. of 6 a minimal sufficient 
statistic? 


12. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d.f. is as follows: 


rosie) =|{ a for0<x <8, 
0 otherwise. 


Here, the value of the parameter 6 is unknown (6 > 0). 
Determine the M.L.E. of the median of this distribution, 
and show that this estimator is a minimal sufficient statistic 
for @. 


13. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution on the interval [a, b], where both 
endpoints a and b are unknown. Are the M.L.E.’s of a and 
b minimal jointly sufficient statistics? 


14, For the conditions of Exercise 5, the M.L.E.’s of the 
means, the variances, and the correlation are given in 
Exercise 24 of Sec. 7.6. Are these five estimators minimal 
jointly sufficient statistics? 


15. Suppose that Xj, ..., X,, form a random sample from 
the Bernoulli distribution with parameter p, which is un- 
known, and that the prior distribution of p is a certain 
specified beta distribution. Is the Bayes estimator of p 
with respect to the squared error loss function a minimal 
sufficient statistic? 


16. Suppose that Xj, ..., X,, form a random sample from 
a Poisson distribution for which the value of the mean A is 
unknown, and that the prior distribution of A is a certain 
specified gamma distribution. Is the Bayes estimator of A 
with respect to the squared error loss function a minimal 
sufficient statistic? 


17. Suppose that Xj, ..., X,, form a random sample from 
a normal distribution for which the value of the mean jp 
is unknown and the value of the variance is known, and 
the prior distribution of is a certain specified normal 
distribution. Is the Bayes estimator of jz with respect to the 
squared error loss function a minimal sufficient statistic? 


* 7.9 Improving an Estimator 


In this section, we show how to improve upon an estimator that is not a function of 
a sufficient statistic by using an estimator that is a function of a sufficient statistic. 


The Mean Squared Error of an Estimator 


Example 
7.9.1 


Customer Arrivals. A store owner is interested in the probability p that exactly one 
customer will arrive during a typical hour. She models customer arrivals as a Poisson 


process with rate 6 per hour and observes how many customers arrive during each 
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of n hours, X1,..., X,. She converts each X; to Y; = 1if X; =1and Y; =Oif xX; 41. 
Then ¥;,..., Y,, is arandom sample from the Bernoulli distribution with parameter 
p. The store owner then estimates p by 6(X) = )7"_, Y,/n. Is this a good estimator? In 
particular, if the store owner wants to minimize mean squared error, is there another 
estimator that we can show is better? < 


In general, suppose that X = (X;,..., X,,) form a random sample from a distri- 
bution for which the p.d-f. or the p.f. is f(x|0), where the parameter 6 must belong 
to some parameter space Q. In this section, 8 can be a one-dimensional parameter 
or a vector of parameters. For each random variable Z = g(Xj,..., X,,), we shall let 
E,(Z) denote the expectation of Z calculated with respect to the joint p.d-f. or joint 
p.f. f,(x|@). If we were thinking of 6 as a random variable, then E,4(Z) = E(Z|@). For 
example, if f,,(x|@) is a p.d-f., then 


EZ) = | f BOOP OAR och, 


We shall suppose that the value of 6 is unknown and that we want to estimate 
some function h(6). If @ is a vector, h(@) might be one of the coordinates or a function 
of all coordinates, and so on. We shall assume that the squared error loss function is 
to be used. Also, for each given estimator 5(X) and every given value of 6 € Q, we 
shall let R(@, 5) denote the M.S.E. of 6 calculated with respect to the given value of 
6. Thus, 


R(6, 8) = Eg ((8(X) — h)Pp). (7.9.1) 


If we do not assign a prior distribution to 6, then it is desired to find an estimator 6 
for which the M.S.E. R(@, 5) is small for every value of 6 € Q or, at least, for a wide 
range of values of 0. 

Suppose now that T is a vector of jointly sufficient statistics for 0. In the re- 
mainder of this section we shall refer to T simply as the sufficient statistic. If T is 
one-dimensional, just pretend that we wrote it as T. Consider a statistician A who 
plans to use a particular estimator 5(X). In Sec. 7.7 we remarked that another statisti- 
cian B who learns only the value of the sufficient statistic T can generate, by means of 
an auxiliary randomization, an estimator that will have exactly the same distribution 
as 6(X) and, in particular, will have the same mean squared error as 6(X) for every 
value of 6 € Q. We shall now show that even without using an auxiliary randomiza- 
tion, statistician B can find an estimator 5p that depends on the observations X only 
through the sufficient statistic T and is at least as good an estimator as 5 in the sense 
that R(O, 59) < R(6, 5), for every value of 6 € Q. 


Conditional Expectation When a Sufficient Statistic Is Known 


We shall define the estimator 5)(T) by the following conditional expectation: 
69(T) = E,[6(X)|T]. (7.9.2) 


Since T is a sufficient statistic, the conditional joint distribution of X,,..., X,, for 
each given value of T is the same for every value of 6 € Q. Therefore, for any given 
value of T, the conditional expectation of the function 5(X) will be the same for 
every value of 6 € Q. It follows that the conditional expectation in Eq. (7.9.2) will 
depend on the value of T but will not actually depend on the value of @. In other 
words, the function 69(T) is indeed an estimator of 6 because it depends only on the 
observations X and does not depend on the unknown value of @. For this reason, we 
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7.9.1 
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can omit the subscript 6 on the expectation symbol E£ in Eq. (7.9.2), and we can write 
the relation as follows: 


89(T) = E[5(X)|T]. (7.9.3) 


We can now prove the following theorem, which was established independently 
by D. Blackwell and C. R. Rao in the late 1940s. 


Let 5(X) be an estimator, let T be a sufficient statistic for 6, and let the estimator 
59(T) be defined as in Eq. (7.9.3). Then for every value of 8 € Q, 


R(O, dp) < RO, 8). (7.9.4) 


Furthermore, if R(6, 5) < oo, there is strict inequality in (7.9.4) unless 5(X) is a 
function of T. 


Proof If the M.S.E. R(@, 3) is infinite for a given value of @ € Q, then the relation 
(7.9.4) is automatically satisfied. We shall assume, therefore, that R(, 5) < oo. It 
follows from part (a) of Exercise 4 in Sec. 4.4 that 


Eo([3(X) — Of) = (Ep[6(X)] — 9), 


and it can be shown that this same relationship must also hold if the expectations are 
replaced by conditional expectations given T. Therefore, 


E,([3(X) — OPT) = (Eo[S(X)|T] — 0)? = [89(T) — Of. (7.9.5) 
It now follows from relation (7.9.5) that 


RO, 59) = Eo[{5o(T) — 0}7] < Eo{ Eo [{5(X) — OV IT} 
= E,[{5(X) — 0}"] = R@, 4), 


where the next-to-last equality follows from Theorem 4.7.1, the law of total proba- 
bility for expectations. Hence, R(6, 59) < R(@, 5) for every value of 6 € Q. 

Finally, suppose that R(@, 5) < oo and that 5(X) is not a function of T. That is, 
there is no function g(T) such that Pr(é(X) = g(T)|T) = 1. Then part (b) of Exercise 4 
in Sec. 4.4 (conditional on 7) says that there is strict inequality in (7.9.4). | 


Customer Arrivals. Return now to Example 7.9.1. Let 6 stand for the rate of customer 
arrivals in units per hour. Then X forms a random sample from the Poisson distribu- 
tion with mean 6. Example 7.7.2 shows us that a sufficient statistic is T = }~"_, X;. 
The distribution of T is the Poisson distribution with mean n@. We shall now compute 


59(T) = E[S(X)|T], 


where 6(X) = )°"_, Y;/n was defined in Example 7.9.1. (Recall that Y; =1if X;=1 
and Y; = 0 if X; #1 so that 5(X) is the proportion of hours in which exactly one 
customer arrives.) For each i and each possible value t of T, it is easy to see that 


prix =, Tat) PECK RL Dye X= P-1) 


=) = ea Pr(T =1) 
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For t = 0, Pr(X; = 1|T = 0) = 0 trivially. For t > 0, we see that 


—nd t 
peer a = Se 
t! 


el-1P (in _ 1Je)'-! _ e"In = 1}'-16° 


= a _— 2-9 
Pr { X;=1, >> X,;=1-1]) =e7%6 x ea o ao 


J#i 
The ratio of these two probabilities is 
t 1 t-1 
But = =+ (1-=) : (7.9.6) 
n n 


It follows that 


Yj 


S(t) = E[Sy@)IT =t]= E E 


n 
r=1 =+  EWIT =. 
n 
i=l 


According to Eq. (7.9.6), all E(Y;|T =1t) are the same, so d9(f) is the right-hand side 
of Eq. (7.9.6). That 59(7) is better than 6(X) under squared error loss follows from 
Theorem 7.9.1. < 


A result similar to Theorem 7.9.1 holds if R(6, 5) is defined as the M.A.E. of 
an estimator for a given value of 6 € Q instead of the M.S.E. of 6. In other words, 
suppose that R(@, 5) is defined as follows: 


R(, 5) = Ey(|8(X) — 6]). (7.9.7) 


Then it can be shown (see Exercise 10 at the end of this section) that Theorem 7.9.1 
is still true. 


Inadmissible/Admissible/Dominates. Suppose that R(0, 5) is defined by either Eq. 
(7.9.1) or Eq. (7.9.7). It is said that an estimator 6 is inadmissible if there exists 
another estimator 5, such that R(@, 59) < R(@, 6) for every value of 6 € Q and there 
is strict inequality in this relation for at least one value of 6 € Q. Under these condi- 
tions, it is also said that the estimator 6) dominates the estimator 5. An estimator 49 
is admissible if there is no other estimator that dominates 5p. 


In the terminology of Definition 7.9.1, Theorem 7.9.1 can be summarized as 
follows: An estimator 6 that is not a function of the sufficient statistic T alone must 
be inadmissible. Theorem 7.9.1 also explicitly identifies an estimator 59 = E(5(X)|T) 
that dominates 5. However, this part of the theorem is somewhat less useful in a 
practical problem, because it is usually very difficult to calculate the conditional 
expectation E(6(X)|T). Theorem 7.9.1 is valuable mainly because it provides further 
strong evidence that we can restrict our search for a good estimator of 6 to those 
estimators that depend on the observations only through a sufficient statistic. 


Estimating the Mean ofa Normal Distribution. Suppose that X,,..., X, formarandom 
sample from a normal distribution for which the mean jz is unknown and the variance 
is known, and let Y; <---< Y,, denote the order statistics of the sample, as defined 
in Sec. 7.8. If n is an odd number, then the middle observation Y(,.1)/2 is called the 
sample median. If n is an even number, then each value between the two middle 
observations Y,,/2 and Y(y/2)41 is a sample median, but the particular value kA jot 
¥(n/2)+1] is often referred to as the sample median. 
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Since the normal distribution from which the sample is drawn is symmetric with 
respect to the point jy, the median of the normal distribution is 4. Therefore, we might 
consider the use of the sample median, or a simple function of the sample median, 
as an estimator of jz. However, it was shown in Example 7.7.4 that the sample mean 
X,, is a sufficient statistic for . It follows from Theorem 7.9.1 that every function 
of the sample median that might be used as an estimator of jy will be dominated by 
some other function of X,,. In searching for an estimator of jz, we need consider only 


functions of X,. < 
Example Estimating the Standard Deviation of a Normal Distribution. Suppose that X;,..., X,, 
7.9.4 form a random sample from a normal distribution for which both the mean yz and 

2 


the variance o~ are unknown, and again let Y; <--- < Y, denote the order statistics 
of the sample. The difference Y,, — Y; is called the range of the sample, and we might 
consider using some simple function of the range as an estimator of the standard 
deviation o. However, it was shown in Example 7.8.2 that the statistics )~”_, X; and 
ae ‘ are jointly sufficient for the parameters jz and o?. Therefore, every function 
of the range that might be used as an estimator of o will be dominated by a function 


of )7"_, X; and )-"_, X?. < 

Example Failure Times of Ball Bearings. Suppose that we wish to estimate the mean failure time 
7.9.5 of the ball bearings described in Example 5.6.9 based on the sample of 23 observed 
failure times. Let Y;, ..., Y3 be the observed failure times (not the logarithms). We 


might consider using the average Y,, = _ Bat Y; as an estimator. Suppose that we 
continue to model the logarithms X; = log(Y;) as normal random variables with mean 
@ and variance 0.25. Then Y; has the lognormal distribution with parameters 6 and 
0.25. From Eq. (5.6.15), the mean of Y; is exp(@ + 0.125), the mean failure time. 
However, we know that X,, is sufficient. Since Y,, is not a function of X,,, there is 
a function of X,, that improves on Y,, as an estimator of the mean failure time. We 
can actually find which function that is. First, write 


n 
EW 1%) == > BOK). (7.9.8) 
i=1 
In Exercise 15 of Sec. 5.10, you proved that the conditional distribution of X; given 
X,, =X, is the normal distribution with mean x, and variance 0.25(1 — 1/n) for every 
i. It follows that, for each i, the conditional distribution of Y, given X,, is the lognormal 
distribution with parameters X,, and 0.25(1 — 1/n). Hence, it follows from Eq. (5.6.15) 
that the conditional mean of Y, given X,, is exp[X,, + 0.125(1 — 1/n)] for all i, and 
Eq. (7.9.8) equals exp[X,, + 0.125(1 — 1/n)] as well. < 


Limitation of the Use of Sufficient Statistics 


When the foregoing theory of sufficient statistics is applied in a statistical problem, 
it is important to keep in mind the following limitation. The existence and the form 
of a sufficient statistic in a particular problem depend critically on the form of the 
function assumed for the p.d.f. or the p.f. A statistic that is a sufficient statistic when it 
is assumed that the p.d.f. is f(x|@) may not be a sufficient statistic when it is assumed 
that the p.d.f. is g(x|@), even though g(x|@) may be quite similar to f(x|@) for every 
value of 6 € Q. Suppose that a statistician is in doubt about the exact form of the p.d.f. 
in a specific problem but assumes for convenience that the p.d.f. is f(x|@); suppose 
also that the statistic T is a sufficient statistic under this assumption. Because of the 


460 


Chapter 7 Estimation 


statistician’s uncertainty about the exact form of the p.d.f., he may wish to use an 
estimator of 6 that performs reasonably well for a wide variety of possible p.d.f.’s, 
even though the selected estimator may not meet the requirement that it should 
depend on the observations only through the statistic T. 

An estimator that performs reasonably well for a wide variety of possible p.d.f.’s, 
even though it may not necessarily be the best available estimator for any particular 
family of p.d.f’s, is often called a robust estimator. We shall consider robust estimators 
further in Chapter 10. 

The preceding discussion also raises another useful point to keep in mind. In 
Sec. 7.2, we introduced sensitivity analysis as a way to study the effect of the choice 
of prior distribution on an inference. The same idea can be applied to any feature of 
a statistical model that is chosen by a statistician. In particular, the distribution for 
the observations given the parameters, as defined through f (x|@), is often chosen for 
convenience rather than through a careful analysis. One can perform an inference 
repeatedly using different distributions for the observable data. The comparison of 


the resulting inferences from each choice is another form of sensitivity analysis. 


Summary 


Suppose that T is a sufficient statistic, and we are trying to estimate a parameter with 
squared error loss. Suppose that an estimator 5(X) is not a function of T. Then 6 can 
be improved by using 6)(T), the conditional mean of 5(X) given T. Because 59(T) has 
the same mean as 5(X) and its variance is no larger, it follows that 59(T) has M.S.E. 
that is no larger than that of 5(X). 


Exercises 


1. Suppose that the random variables X;,..., X,, forma 
random sample of size n (n > 2) from the normal distribu- 
tion with mean 0 and unknown variance 6. Suppose also 
that for every estimator 5(X,..., X,), the M.S.E. R(@, 5) 
is defined by Eq. (7.9.1). Explain why the sample variance 
is an inadmissible estimator of 0. 


2. Suppose that the random variables X;,..., X, form 
a random sample of size n (n > 2) from the uniform dis- 
tribution on the interval [0, 6], where the value of the 
parameter 6 is unknown (6 > 0) and must be estimated. 
Suppose also that for every estimator 5(X,,..., X,,), the 
M.S.E. R(@, 5) is defined by Eq. (7.9.1). Explain why the 
estimator 5,(X1,..., X,,) =2X,, is inadmissible. 


3. Consider again the conditions of Exercise 2, and let the 
estimator 6, be as defined in that exercise. Determine the 
value of the M.S.E. R(@, 6;) for 6 > 0. 


4. Consider again the conditions of Exercise 2. Let Y, = 
max{X,,..., X,} and consider the estimator 6(Xj,..., 
X= Yq 

a. Determine the M.S.E. R(@, 55) for 6 > 0. 

b. Show that for n = 2, R(@, 6) = R(O, 5,) for 6 > 0. 


c. Show that for n > 3, the estimator 5, dominates the 
estimator 5}. 


5. Consider again the conditions of Exercises 2 and 4. 
Show that there exists a constant c* such that the estimator 
c*Y, dominates every other estimator having the form cY,, 
forc #c*. 


6. Suppose that X;,..., X, forma random sample of size 
n (n > 2) from the gamma distribution with parameters a 
and £, where the value of a is unknown (@ > 0) and the 
value of f is known. Explain why X,, is an inadmissible es- 
timator of the mean of this distribution when the squared 
error loss function is used. 


7. Suppose that X;,..., X,, form a random sample from 
an exponential distribution for which the value of the pa- 
rameter 6 is unknown (8 > 0) and must be estimated by 
using the squared error loss function. Let 5 be the estima- 
tor such that 5(Xy,..., X,,) =3 for all possible values of 
ee 


a. Determine the value of the M.S.E. R(f, 5) for 6 > 0. 
b. Explain why the estimator 6 must be admissible. 


8. Suppose that a random sample of n observations is 
taken from a Poisson distribution for which the value of 
the mean 6 is unknown (6 > 0), and the value of 6 = e~? 
must be estimated by using the squared error loss function. 
Since f is equal to the probability that an observation from 
this Poisson distribution will have the value 0, a natural 
estimator of f is the proportion 8 of observations in the 
random sample that have the value 0. Explain why £ is an 
inadmissible estimator of f. 


9. For every random variable X, show that |E(X)| < 
E(|X}). 


10. Let X;,..., X,, form a random sample from a distri- 
bution for which the p.d-f. or the p.f. is f(x|0), where @ € Q. 
Suppose that the value of 6 must be estimated, and that 
T is a sufficient statistic for 9. Let 6 be an arbitrary esti- 
mator of @, and let 5) be another estimator defined by the 
relation 69 = E(6|T). Show that for every value of 6 € Q, 


E4(\89 — |) < Eg(|6 — 9). 


11. Suppose that the variables X;,..., X, form arandom 
sample from a distribution for which the p.d.f. or the p.f. 
is f(x|@), where 0 € Q, and let 6 denote the M.L.E. of 
9. Suppose also that the statistic T is a sufficient statistic 
for 6, and let the estimator 59 be defined by the relation 
59 = E(@|T). Compare the estimators @ and 4p. 
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12. Suppose that X;,..., X, form a sequence of n Ber- 
noulli trials for which the probability p of success on any 
given trial is unknown (0 < p < 1), and let T= )°y_, Xj. 
Determine the form of the estimator E(X,|T). 


13. Suppose that Xj, ..., X,, form a random sample from 
a Poisson distribution for which the value of the mean 0 is 
unknown (6 > 0). Let T= }°"_, X;, and fori=1,...,n, 
let the statistic Y; be defined as follows: 


aa 
i~10 


Determine the form of the estimator E(Y;|T). 


if X,=0, 
if X> 0. 


14. Consider again the conditions of Exercise 8. Deter- 
mine the form of the estimator E(8|T). You may wish to 
use results obtained while solving Exercise 13. 


15. Find the M.L.E. of exp(@ + 0.125) in Example 7.9.5. 
Both the M.L.E. and the estimator in Example 7.9.5 have 
the form exp(X,, + c) for some constant c. Find the value c 


so that the estimator exp(X,, + c) has the smallest possible 
M.S.E. 


16. In Example 7.9.1, find the formula for p in terms of 
0, the mean of each X;. Also find the M.L.E. of p and 
show that the estimator 69(7) in Example 7.9.2 is nearly 
the same as the M.L.E. if n is large. 
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1. A program will be run with 25 different sets of input. 
Let @ stand for the probability that an execution error will 
occur during a single run. We believe that, conditional on 
8, each run of the program will encounter an error with 
probability 6 and that the different runs are independent. 
Prior to running the program, we believe that @ has the 
uniform distribution on the interval [0, 1]. Suppose that 
we get errors during 10 of the 25 runs. 


a. Find the posterior distribution of 0. 
b. If we wanted to estimate @ by 6 using squared error 
loss, what would our estimate 6 be? 
2. Suppose that X),..., X, are iid. with Pr(x; = 1) =0 
and Pr(X; =0)=1 — 6, where 6 is unknown (0 <6 <1). Find 
the M.L.E. of 6”. 


3. Suppose that the proportion @ of bad apples in a large 
lot is unknown and has the following prior p.d_-f.: 


277 _ a3 
a=," 6) for 0 <6 <1, 
0 otherwise. 
Suppose that a random sample of 10 apples is drawn from 
the lot, and it is found that three are bad. Find the Bayes 


estimate of 6 with respect to the squared error loss func- 
tion. 


4. Suppose that X,,..., X,, form a random sample from 
a uniform distribution with the following p.d.-f.: 


f (x10) = z for 6 <x < 206, 
0 otherwise. 


Assuming that the value of @ is unknown (@ > 0), deter- 
mine the M.L.E. of 0. 


5. Suppose that X, and X> are independent random vari- 
ables, and that X; has the normal distribution with mean 
b; and variance o? for i = 1, 2. Suppose also that by, bp, 
of, and a5 are known positive constants, and that jy is an 
unknown parameter. Determine the M.L.E. of jz based on 
xy and X>. 


6. Let wW(a) =I’(a)/T(a) for a > 0 (the digamma func- 
tion). Show that 


wiat+l)=wa)t+ ~ 
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7. Suppose that a regular light bulb, a long-life light bulb, 
and an extra-long-life light bulb are being tested. The life- 
time X, of the regular bulb has the exponential distribu- 
tion with mean @, the lifetime X, of the long-life bulb has 
the exponential distribution with mean 26, and the life- 
time X3 of the extra-long-life bulb has the exponential 
distribution with mean 30. 


a. Determine the M.L.E. of 6 based on the observations 
X41, X, and X3. 


b. Let y = 1/6, and suppose that the prior distribution 
of y is the gamma distribution with parameters a and 
f. Determine the posterior distribution of y given 
X41, X, and X3. 


8. Consider a Markov chain with two possible states s; 
and s, and with stationary transition probabilities as given 
in the following transition matrix P: 


Sy 52 
Sy 0 1-0 


= Hale val 


where the value of 6 is unknown (0 < 6 < 1). Suppose that 
the initial state X, of the chain is sj, and let X7,..., Xn44 
denote the state of the chain at each of the next n suc- 
cessive periods. Determine the M.L.E. of 6 based on the 


observations X>,..., X;44- 


9. Suppose that an observation X is drawn from a distri- 
bution with the following p.d.f.: 


poste) ={ ¢ for0 <x <8@, 


0 otherwise. 


Also, suppose that the prior p.d.f. of 6 is 


Ge" ford >0 
0 otherwise. 


<0) =| 


Determine the Bayes estimator of 6 with respect to (a) the 
mean squared error loss function and (b) the absolute 
error loss function. 


10. Suppose that Xj,..., X,, form” Bernoulli trials with 
parameter 6 = (1/3)(1+ #), where the value of 8 is un- 
known (0 < f < 1). Determine the M.L.E. of £. 


11. The method of randomized response is sometimes 
used to conduct surveys on sensitive topics. A simple ver- 
sion of the method can be described as follows: A random 
sample of n persons is drawn from a large population. For 
each person in the sample there is probability 1/2 that the 
person will be asked a standard question and probability 
1/2 that the person will be asked a sensitive question. Fur- 
thermore, this selection of the standard or the sensitive 
question is made independently from person to person. 
If a person is asked the standard question, then there is 
probability 1/2 that she will give a positive response; how- 
ever if she is asked the sensitive question, then there is 


an unknown probability p that she will give a positive re- 
sponse. The statistician can observe only the total number 
X of positive responses that were given by the n persons 
in the sample. He cannot observe which of these persons 
were asked the sensitive question or how many persons in 
the sample were asked the sensitive question. Determine 
the M.L.E. of p based on the observation X. 


12. Suppose that a random sample of four observations is 
to be drawn from the uniform distribution on the interval 
[0, 6], and that the prior distribution of 6 has the following 
p.df.: 


£(6) = a for 6 > 1, 
0 otherwise. 


Suppose that the values of the observations in the sam- 
ple are found to be 0.6, 0.4, 0.8, and 0.9. Determine the 
Bayes estimate of 6 with respect to the squared error loss 
function. 


13. For the conditions of Exercise 12, determine the 
Bayes estimate of 6 with respect to the absolute error loss 
function. 


14. Suppose that X;,..., X,, form a random sample from 
a distribution with the following p.d.f:: 


forx >06, 
otherwise, 


—B(x-0) 
F(RIB, 6) = | es 


where 6 and 6 are unknown (6 > 0, —oo < 6 < ov). De- 
termine a pair of jointly sufficient statistics. 


15. Suppose that X;,..., X,, form a random sample from 
the Pareto distribution with parameters xg and a (see Ex- 
ercise 16 of Sec. 5.7), where x is unknown and @ is known. 
Determine the M.L.E. of xo. 


16. Determine whether the estimator found in Exer- 
cise 15 is a minimal sufficient statistic. 


17. Consider again the conditions of Exercise 15, but sup- 
pose now that both parameters x) and a are unknown. 
Determine the M.L.E.’s of x9 and a. 


18. Determine whether the estimators found in Exer- 
cise 17 are minimal jointly sufficient statistics. 


19. Suppose that the random variable X has a binomial 
distribution with an unknown value of n and a known 
value of p (0 < p < 1). Determine the M.L.E. of n based 
on the observation X. Hint: Consider the ratio 


Poli tp) 
f(x|n, p) 
20. Suppose that two observations X; and X> are drawn 


at random from a uniform distribution with the following 
p.df.: 


for0 <x <6 or 20 <x < 30, 
otherwise, 


1 
x|0) =? 20 
FOO) {2 


where the value of 6 is unknown (6 > 0). Determine the 
M.L.E. of 6 for each of the following pairs of observed 
values of X; and X>: 


a. X,;=7and X,=9 
b. X,=4and X,=9 
ce. X,;=S5and X,=9 


21. Suppose that a random sample Xj,..., X,, is to be 
taken from the normal distribution with unknown mean 
@ and variance 100, and the prior distribution of 6 is the 
normal distribution with specified mean jp and variance 
25. Suppose that @ is to be estimated using the squared 
error loss function, and the sampling cost of each obser- 
vation is 0.25 (in appropriate units). If the total cost of the 
estimation procedure is equal to the expected loss of the 
Bayes estimator plus the sampling cost (0.25)n, what is the 
sample size n for which the total cost will be a minimum? 


22. Suppose that X;,..., X,, form a random sample from 
the Poisson distribution with unknown mean 6, and the 
variance of this distribution is to be estimated using the 
squared error loss function. Determine whether or not the 
sample variance is an admissible estimator. 


23. The formulas (7.5.6) for the sample mean and sam- 
ple variance are of theoretical importance, but they can 
be inefficient or produce inaccurate results if used for nu- 
merical calculation with very large samples. For example, 
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let x1, x2, ... be a sequence of real numbers. Computing 
(4; — X,)? directly requires that we first compute x, 
and then still have all n observations available so that we 
can compute x; — x, for each 7. Also, if 1 is very large, 
then computing x, by adding the x;’s together can pro- 
duce large rounding errors once the next x; becomes very 
small relative to the accummulated sum. 


a. Prove the seemingly more efficient formula 


n 


n 

ae Ne 2 ae 

) Gx)" = ) Xv — NX). 
i=1 


f=1 


With this formula, we could accummulate the sum 
of the x;’s and Fra separately and forget each obser- 
vation afterward. We would still suffer the rounding 
error problem mentioned above. 

b. Prove the following formulas that reduce the round- 
ing error problem in accummulating a sum. For each 


integer n 
Xn41 = Xn t (Xn41 Xn)s 
n+1 
n+l n 
= 2 = \2 n = \2 
xe Xn+y) =) Gj Xn) + (X41 Xn) 7 
i=l i=l aad 


These formulas allow us to forget each x; after we use 
it to update the two formulas. 
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8.1 The Sampling Distribution of a Statistic 


A statistic is a function of some observable random variables, and hence is itself a 
random variable with a distribution. That distribution is its sampling distribution, 
and it tells us what values the statistic is likely to assume and how likely it is to 
assume those values prior to observing our data. When the distribution of the 
observable data is indexed by a parameter, the sampling distribution is specified 
as the distribution of the statistic for a given value of the parameter. 


Statistics and Estimators 


A Clinical Trial. In the clinical trial first introduced in Example 2.1.4, let 6 stand for 
the proportion who do not relapse among all possible imipramine patients. We could 
use the observed proportion of patients without relapse in the imipramine group to 
estimate @. Prior to observing the data, the proportion of sampled patients with no 
relapse is a random variable T that has a distribution and will not exactly equal the 
parameter 6. However, we hope that T will be close to 6 with high probability. For 
example, we could try to compute the probability that |7 — 6| < 0.1. Such calculations 
require that we know the distribution of the random variable 7. In the clinical trial, 
we modeled the responses of the 40 patients in the imipramine group as conditionally 
(given 0) iid. Bernoulli random variables with parameter 0. It follows that the 
conditional distribution of 40T given @ is the binomial distribution with parameters 
40 and 6. The distribution of T can be derived easily from this. Indeed T has the 
following p.-f. given 6: 


40 = 
f(tl@) = (sn,)etva —6)°O), -fort=0, 4, 4.055 3 1, 
and f(t|@) = 0 otherwise. <1 


The distribution at the end of Example 8.1.1 is called the sampling distribution of 
the statistic T, and we can use it to help address questions such as how close we expect 
T to be to 6 prior to observing the data. We can also use the sampling distribution 
of T to help to determine how much we will learn about 6 by observing 7. If we are 
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trying to decide which of two different statistics to use as an estimator, their sampling 
distributions can be useful for helping us to compare them. 

The concept of sampling distribution applies to a larger class of random variables 
than statistics. 


Sampling Distribution. Suppose that the random variables X = (X,..., X,) forma 
random sample from a distribution involving a parameter 6 whose value is unknown. 
Let T be a function of X and possibly 6. That is, T =r(X,,..., X,, 0). The distribu- 
tion of T (given 0) is called the sampling distribution of T. We will use the notation 
E,(T) to denote the mean of T calculated from its sampling distribution. 


The name “sampling distribution” comes from the fact that T depends on a random 
sample and so its distribution is derived from the distribution of the sample. 

Often, the random variable T in Definition 8.1.1 will not depend on 6, and hence 
will be a statistic as defined in Definition 7.1.4. In particular, if T is an estimator 
of @ (as defined in Definition 7.4.1), then T is also a statistic because it is a function 
of X. Therefore, in principle, it is possible to derive the sampling distribution of each 
estimator of 0. In fact, the distributions of many estimators and statistics have already 
been found in previous parts of this book. 


Sampling Distribution of the M.L.E. of the Mean of a Normal Distribution. Supppose 
that X,,..., X,, form a random sample from the normal distribution with mean ju 
and variance o*. We found in Examples 7.5.5 and 7.5.6 that the sample mean X,, is 
the M.L.E. of jz. Furthermore, it was found in Corollary 5.6.2 that the distribution of 
X,, is the normal distribution with mean yz and variance o7/n. <l 


In this chapter, we shall derive, for random samples from a normal distribution, 
the distribution of the sample variance and the distributions of various functions 
of the sample mean and the sample variance. These derivations will lead us to 
the definitions of some new distributions that play important roles in problems 
of statistical inference. In addition, we shall study certain general properties of 
estimators and their sampling distributions. 


Purpose of the Sampling Distribution 


Lifetimes of Electronic Components. Consider the company in Example 7.1.1 that 
sells electronic components. They model the lifetimes of these components as i.i.d. 
exponential random variables with parameter 6 conditional on 6. They model @ as 
having the gamma distribution with parameters 1 and 2. Now, suppose that they are 
about to observe n = 3 lifetimes, and they will use the posterior mean of @ as an 
estimator. According to Theorem 7.3.4, the posterior distribution of @ will be the 
gamma distribution with parameters 1+ 3=4and2 + x4 X;. The posterior mean 
will then be 6 =4/(2 + 3_, X}). 

Prior to observing the three lifetimes, the company may want to know how likely 
it is that 6 will be close to 6. For example, they may want to compute Pr(|6 — 6| < 0.1). 
In addition, other interested parties such as customers might be interested in how 
close the estimator is going to be to @. But these others might not wish to assign 
the same prior distribution to 6. Indeed, some of them may wish to assign no prior 
distribution at all. We shall soon see that all of these people will find it useful to de- 
termine the sampling distribution of 6. What they do with that sampling distribution 
will differ, but they will all be able to make use of the sampling distribution. < 
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In Example 8.1.3, after the company observes the three lifetimes, they will be 
interested only in the posterior distribution of 6. They could then compute the 
posterior probability that |6 — 6| < 0.1. However, before the sample is taken, both 6 
and 6 are random and Pr(|@ — 6| < 0.1) involves the joint distribution of 6 and 6. The 
sampling distribution is merely the conditional distribution of 6 given 6. Hence, the 
law of total probability says that 


Pr(|6 —6| <O.1)=E [Prcié ~6|< 0.116) 


In this way, the company makes use of the sampling distribution of 6 as an interme- 
diate calculation on the way to computing Pr(|O — 0| < 0.1). 


Lifetimes of Electronic Components. In Example 8.1.3, the sampling distribution of 6 
does not have aname, but it is easy to see that @ is a monotone function of the statistic 
r= 4 X; that has the gamma distribution with parameters 3 and 6 (conditional 
on 0). So, we can compute the c.d.f. F(-|9) for the sampling distribution of @ from the 
c.d.f. G(-|@) of the distribution of T. Argue as follows. For t > 0, 


F(t\0) = Pr(6 <t|9) 


For t <0, F(t|@) =0. Most statistical computer packages include the function G, 
which is the c.d.f. of a gamma distribution. The company can now compute, for each 
8, 


Pr(|O — 6| <0.1|6) = F(@ + 0.1/0) — F(@ — 0.1]6). (8.1.1) 


Figure 8.1 shows a graph of this probability as a function of 6. To complete the calcu- 
lation of Pr(|6 — 6| < 0.1), we must integrate (8.1.1) with respect to the distribution 
of 6, that is, the gamma distribution with parameters 1 and 2. This integral cannot 
be performed in closed form and requires a numerical approximation. One such ap- 
proximation would be a simulation, which will be discussed in Chapter 12. In this 
example, the approximation yields Pr(|@ — 6| < 0.1) ~ 0.478. 

Also included in Fig. 8.1 is the calculation of Pr( |6 — 6| < 0.1/6) using 6= 3/T , the 
M.L.E. of 6. The sampling distribution of the M.L.E. can be derived in Exercise 9 at 
the end of this section. Notice that the posterior mean has higher probability of being 
close to @ than does the M.L.E. when @ is near the mean of the prior distribution. 
When 6 is far from the prior mean, the M.L.E. has higher probability of being close 
to 0. < 


Another case in which the sampling distribution of an estimator is needed is 
when the statistician must decide which one of two or more available experiments 
should be performed in order to obtain the best estimator of 6. For example, if she 
must choose which sample size to use for an experiment, then she will typically base 
her decision on the sampling distributions of the different estimators that might be 
used for each sample size. 


Figure 8.1 Plot of Pr(|é — 
6| < 0.16) for both 6 equal 
to the posterior mean and 
6 equal to the M.L.E. in 
Example 8.1.4. 


Example 
8.1.5 
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Pr(|6 — 6| < 0.1) 


As mentioned at the end of Example 8.1.3, there are statisticians who do not wish 
to assign a prior distribution to 6. Those statisticians would not be able to calculate a 
posterior distribution for @. Instead, they would base all of their statistical inferences 
on the sampling distribution of whatever estimators they chose. For example, a 
statistician who chose to use the M.L.E. of 6 in Example 8.1.4 would need to deal 
with the entire curve in Fig. 8.1 corresponding to the M.L.E. in order to decide how 
likely it is that the M.L.E. will be closer to 6 than 0.1. Alternatively, she might choose 
a different measure of how close the M.L.E. is to 0. 


Lifetimes of Electronic Components. Suppose that a statistician chooses to estimate 
0 by the M.L.E., 6 =3/T instead of the posterior mean in Example 8.1.4. This 
statistician may not find the graph in Fig. 8.1 very useful unless she can decide which 
@ values are most important to consider. Instead of calculating Pr(|6 — 6| < 0.1/9), 


she might compute 
) : (8.1.2) 


This is the probability that 6 is within 10% of the value of 6. The probability in (8.1.2) 
could be computed from the sampling distribution of the M.L.E. Alternatively, one 
can notice that 6 /6 =3/(@T), and the distribution of 97 is the gamma distribution 
with parameters 3 and 1. Hence, 6/6 has a distribution that does not depend on @. 
It follows that Pr(|6 /@ —1| <0.1|@) is the same number for all 6. In the notation of 
Example 8.1.4, the c.d.f. of @T is G(-|1), and hence 

—-1 


Pr " 6 =Pr (|= -1] <o.[o) 
7) OT 


= Pr (09 < me < Ll 0) 
OT 


<0.1 


<0.1 


= Pr(2.73 < OT <3.33|0) 
=63330 = 62.73) =0.134. 


The statistician can now claim that the probability is 0.134 that the M.L.E. of 6 will 
be within 10% of the value of 6, no matter what @ is. <l 


The random variable 6/6 in Example 8.1.5 is an example of a pivotal quantity, 
which will be defined and used extensively in Sec. 8.5. 
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Figure 8.2 Plot of Pr(|7 — 
6| < 0.1/6) in Example 8.1.6. 


Exercises 


Example 


8.1.6 
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A Clinical Trial. In Example 8.1.1, we found the sampling distribution of T, the pro- 
portion of patients without relapse in the imipramine group. Using that distribution, 
we can draw a plot similar to that in Fig. 8.1. That is, for each 6, we can compute 
Pr(|T — 6| < 0.1|@). The plot appears in Fig. 8.2. The jumps and cyclic nature of the 
plot are due the discreteness of the distribution of T. The smallest probability is 
0.7318 at 6 = 0.5. (The isolated points that appear below the main part of the graph 
at @ equal to each multiple of 1/40 would appear equally far above the main part of 
the graph, if we had plotted Pr(|T — 6| < 0.1|6) instead of Pr(|T — 0| <0.1]@).) << 


Summary 


The sampling distribution of an estimator 6 is the conditional distribution of the esti- 
mator given the parameter. The sampling distribution can be used as an intermediate 
calculation in assessing the properties of a Bayes estimator prior to observing data. 
More commonly, the sampling distribution is used by those statisticians who prefer 
not to use prior and posterior distributions. For example, before the sample has been 
taken, the statistician can use the sampling distribution of 6 to calculate the proba- 
bility that 6 will be close to 6. If this probability is high for every possible value of 
0, then the statistician can feel confident that the observed value of 6 will be close 
to 6. After the data are observed and a particular estimate is obtained, the statisti- 
cian would like to continue feeling confident that the particular estimate is likely to 
be close to 6, even though explicit posterior probabilities cannot be given. It is not 
always safe to draw such a conclusion, however, as we shall illustrate at the end of 
Example 8.5.11. 


1. Suppose that a random sample X;,..., X,, is to be 2. Suppose that a random sample is to be taken from the 
taken from the uniform distribution on the interval [0, 6] normal distribution with unknown mean @ and standard 
and that @ is unknown. How large a random sample must deviation 2. How large a random sample must be taken 
be taken in order that 


Pr(| max{X,,... 


for all possible 6? 


in order that E,(|X,, — 9|*) < 0.1 for every possible value 
of 6? 


, X,} — | < 0.16) > 0.95, 3. For the conditions of Exercise 2, how large a random 


sample must be taken in order that E,(|X,, — |) < 0.1 for 
every possible value of 6? 


4. For the conditions of Exercise 2, how large a random 
sample must be taken in order that Pr(|X,, — 6| < 0.1) = 
0.95 for every possible value of 6? 


5. Suppose that a random sample is to be taken from the 
Bernoulli distribution with unknown parameter p. Sup- 
pose also that it is believed that the value of p is in the 
neighborhood of 0.2. How large a random sample must 
be taken in order that Pr(|X,, — p| < 0.1) > 0.75 when 
p=0.2? 


6. For the conditions of Exercise 5, use the central limit 
theorem in Sec. 6.3 to find approximately the size of a 
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random sample that must be taken in order that Pr(|X,, — 
P| < 0.1) > 0.95 when p = 0.2. 


7. For the conditions of Exercise 5, how large a random 
sample must be taken in order that E,(\Xn — p\*) < 0.01 
when p = 0.2? 


8. For the conditions of Exercise 5, how large a random 
sample must be taken in order that E,(\Xn — p|?) < 0.01 
for every possible value of p (0 < p <1)? 


9. Let Xj,..., X, be a random sample from the expo- 
nential distribution with parameter 6. Find the c.d.f. for 
the sampling distribution of the M.L.E. of 6. (The M.L.E. 
itself was found in Exercise 7 in Sec. 7.5.) 


8.2 The Chi-Square Distributions 


The family of chi-square (x7) distributions is a subcollection of the family of 
gamma distributions. These special gamma distributions arise as sampling dis- 
tributions of variance estimators based on random samples from a normal distri- 


bution. 


Definition of the Distributions 


Example 
8.2.1 


M.L.E. of the Variance of a Normal Distribution. Suppose that Xy, . 
random sample from the normal distribution with known mean yw and unknown 


.., X, form a 


variance o2. The M.L.E. of o? is found in Exercise 6 in Sec. 7.5. It is 


The distributions of a and ae /o are useful in several statistical problems, and we 
shall derive them in this section. < 


In this section, we shall introduce and discuss a particular class of gamma dis- 
tributions known as the chi-square ( x7) distributions. These distributions, which are 
closely related to random samples from a normal distribution, are widely applied in 
the field of statistics. In the remainder of this book, we shall see how they are applied 
in many important problems of statistical inference. In this section, we shall present 
the definition of the x? distributions and some of their basic mathematical properties. 


Definition 
8.2.1 


x’ Distributions. For each positive number m, the gamma distribution with parame- 
ters a = m/2 and B = 1/2 is called the x? distribution with m degrees of freedom. (See 


Definition 5.7.2 for the definition of the gamma distribution with parameters a and 


B.) 


It is common to restrict the degrees of freedom m in Definition 8.2.1 to be an integer. 
However, there are situations in which it will be useful for the degrees of freedom to 
not be integers, so we will not make that restriction. 
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Theorem 
8.2.1 


Theorem 
8.2.2 


Theorem 
8.2.3 


If a random variable X has the x? distribution with m degrees of freedom, it 
follows from Eq. (5.7.13) that the p.d.f. of X for x > 0 is 


f(x) gle. (8.2.1) 


1 
~~ 2/27 (m/2) 
Also, f(x) =0 for x <0. 

A short table of p quantiles for the x? distribution for various values of p and 
various degrees of freedom is given at the end of this book. Most statistical software 
packages include functions to compute the c.d.f. and the quantile function of an 
arbitrary x7 distribution. 

It follows from Definition 8.2.1, and it can be seen from Eq. (8.2.1), that the 
x° distribution with two degrees of freedom is the exponential distribution with 
parameter 1/2 or, equivalently, the exponential distribution for which the mean is 
2. Thus, the following three distributions are all the same: the gamma distribution 
with parameters a = 1 and 6 = 1/2, the x? distribution with two degrees of freedom, 
and the exponential distribution for which the mean is 2. 


Properties of the Distributions 


The means and variances of x? distributions follow immediately from Theorem 5.7.5, 
and are given here without proof. 


Mean and Variance. If a random variable X has the x? distribution with m degrees of 
freedom, then E(X) =m and Var(X) = 2m. | 


Furthermore, it follows from the moment generating function given in Eq. 
(5.7.15) that the m.g.f. of X is 


1 m/2 1 
t) = —— f t = 
_ (; = =) eg 


The additivity property of the x? distribution, which is presented without proof 
in the next theorem, follows directly from Theorem 5.7.7. 


If the random variables X,, ..., X;, are independent and if X; has the x? distribution 
with m; degrees of freedom (i =1,..., k), then the sum X,; +--- +X; has the x? 
distribution with m, +.---+-+m, degrees of freedom. a 


We shall now establish the basic relation between the x? distributions and the 
standard normal distribution. 


Let X have the standard normal distribution. Then the random variable Y = X? has 
the x? distribution with one degree of freedom. 


Proof Let f(y) and F(y) denote, respectively, the p.d.f. and the c.d.f. of Y. Also, 
since X has the standard normal distribution, we shall let (x) and ®(x) denote the 
p.d.f. and the c.d.f. of X. Then for y > 0, 


F(y) =Pr(¥ < y) = Pr(X? < y) =Pr(-y'”” < x < y¥) 
= &(y/) — &(-y), 


Corollary 
8.2.1 


Example 
8.2.2 


Example 
8.2.3 
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Since f(y) = F’(y) and $(x) = ®’(x), it follows from the chain rule for derivatives 
that 


i 1 _ 
f9) = 90") (5 ) +o(-y"””) (5 2) 
Furthermore, since ¢(y!/*) = ¢(—y!/?) = (22) ~'/2e-9/2, it now follows that 


1 -1/2.-y 
fO) = Gop? ev? for y > 0. 


By comparing this equation with Eq. (8.2.1), it is seen that the p.d.f. of Y is indeed 
the p.d.f. of the x7 distribution with one degree of freedom. rT] 


We can now combine Theorem 8.2.3 with Theorem 8.2.2 to obtain the follow- 
ing result, which provides the main reason that the x? distribution is important in 
statistics. 


If the random variables X,,..., X,, are i..d. with the standard normal distribution, 
then the sum of squares x treet be has the x? distribution with m degrees of 
freedom. a 


M.L.E. of the Variance of a Normal Distribution. In Example 8.2.1, the random variables 
Z; = (X; — w)/o fori =1,...,n form a random sample from the standard normal 
distribution. It follows from Corollary 8.2.1 that the distribution of )7’_, Z? is the 


x? distribution with n degrees of freedom. It is easy to see that aa, Z is precisely 
the same as noe /o*, which appears in Example 8.2.1. So the distribution of noe ja* 
is the x? distribution with n degrees of freedom. The reader should also be able to 
see that the distribution of og itself is the gamma distribution with parameters n/2 
and n/(207) (Exercise 13). < 


Acid Concentration in Cheese. Moore and McCabe (1999, p. D-1) describe an experi- 
ment conducted in Australia to study the relationship between taste and the chemical 
composition of cheese. One chemical whose concentration can affect taste is lactic 
acid. Cheese manufacturers who want to establish a loyal customer base would like 
the taste to be about the same each time a customer purchases the cheese. The vari- 
ation in concentrations of chemicals like lactic acid can lead to variation in the taste 
of cheese. Suppose that we model the concentration of lactic acid in several chunks 
of cheese as independent normal random variables with mean jy and variance o7. 
We are interested in how much these concentrations differ from the value yw. Let 
X1,..., X, be the concentrations in k chunks, and let Z; = (X; — )/o. Then 


1 : o : 
Y=—)0(X;-w?=— 0 Zz? 
kia k i=l 


is one measure of how much the k concentrations differ from jz. Suppose that a dif- 
ference of u or more in lactic acid concentration is enough to cause a noticeable 
difference in taste. We might then wish to calculate Pr(Y < u*). According to Corol- 
lary 8.2.1, the distribution of W = kY/o? is x? with k degrees of freedom. Hence, 
Pr(Y¥ <u?) =Pr(W <ku?/o”). 

For example, suppose that o* = 0.09, and we are interested in k = 10 cheese 
chunks. Furthermore, suppose that u = 0.3 is the critical difference of interest. We 
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can write 
10 x 0.09 


Pr(Y < 0.37) = Pr(w < 
0.09 


) = Pr(W < 10). (8.2.2) 
Using the table of quantiles of the x7 distribution with 10 degrees of freedom, we 
see that 10 is between the 0.5 and 0.6 quantiles. In fact, the probability in Eq. (8.2.2) 
can be found by computer software to equal 0.56, so there is a 44 percent chance 
that the average squared difference between lactic acid concentration and mean 
concentration in 10 chunks will be more than the desired amount. If this probability is 
too large, the manufacturer might wish to invest some effort in reducing the variance 
of lactic acid concentration. < 


Summary 


The chi-square distribution with n degrees of freedom is the same as the gamma 
distribution with parameters m/2 and 1/2. It is the distribution of the sum of squares 
of a sample of m independent standard normal random variables. The mean of the 


x? distribution with m degrees of freedom is m, and the variance is 2m. 


Exercises 


1. Suppose that we will sample 20 chunks of cheese in 
Example 8.2.3. Let T = Reef — )*/20, where X; is the 
concentration of lactic acid in the ith chunk. Assume that 
o* = 0.09. What number c satisfies Pr(T <c) =0.9? 


2. Find the mode of the x? distribution with m degrees of 
freedom (m = 1, 2,...). 


3. Sketch the p.d-f. of the x? distribution with m degrees of 
freedom for each of the following values of m. Locate the 
mean, the median, and the mode on each sketch. (a) m = 1; 
(b) m = 2; (ec) m = 3; (d) m =4. 


4. Suppose that a point (X, Y) is to be chosen at random 
in the xy-plane, where X and Y are independent random 
variables and each has the standard normal distribution. 
If a circle is drawn in the xy-plane with its center at the 
origin, what is the radius of the smallest circle that can be 
chosen in order for there to be probability 0.99 that the 
point (X, Y) will lie inside the circle? 


5. Suppose that a point (X, Y, Z) is to be chosen at ran- 
dom in three-dimensional space, where X, Y, and Z are 
independent random variables and each has the standard 
normal distribution. What is the probability that the dis- 
tance from the origin to the point will be less than 1 unit? 


6. When the motion of a microscopic particle in a liquid 
or a gas is observed, it is seen that the motion is irregular 
because the particle collides frequently with other parti- 
cles. The probability model for this motion, which is called 
Brownian motion, is as follows: A coordinate system is 
chosen in the liquid or gas. Suppose that the particle is 
at the origin of this coordinate system at time t = 0, and 


let (X, Y, Z) denote the coordinates of the particle at any 
time t > 0. The random variables X, Y, and Z are i.i.d., 
and each of them has the normal distribution with mean 
0 and variance o?t. Find the probability that at time t = 2 
the particle will lie within a sphere whose center is at the 
origin and whose radius is 40. 


7. Suppose that the random variables X;,..., X, are in- 
dependent, and each random variable X; has a continuous 
c.d.f. F;. Also, let the random variable Y be defined by the 
relation Y = —2 )7"_, log F;(X;). Show that Y has the x? 
distribution with 2n degrees of freedom. 


8. Suppose that X,..., X, form a random sample from 
the uniform distribution on the interval [0, 1], and let 
W denote the range of the sample, as defined in Exam- 
ple 3.9.7. Also, let g,(x) denote the p.d-f. of the random 
variable 2n(1 — W), and let g(x) denote the p.d.f. of the 
x? distribution with four degrees of freedom. Show that 


lim g,(x)=g(x) forx>0. 
= Oo 


9. Suppose that X,..., X, form a random sample from 
the normal distribution with mean jz and variance o2. Find 
the distribution of 
n(X, ~~ bu)? 
o2 : 
10. Suppose that six random variables X;,..., X5 form 


a random sample from the standard normal distribution, 
and let 
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Determine a value of c such that the random variable cY 
will have a x distribution. 


11. Ifa random variable X has the x? distribution with m 
degrees of freedom, then the distribution of X!/? is calleda 
chi (x) distribution with m degrees of freedom. Determine 
the mean of this distribution. 


12. Consider again the situation described in Example 


8.2.3. How small would o2 need to be in order for Pr(Y < 
0.09) > 0.9? 


13. Prove that the distribution of oe in Examples 8.2.1 
and 8.2.2 is the gamma distribution with parameters n/2 
and n/(202). 


Example 
8.3.1 


8.3 Joint Distribution of the Sample Mean 
and Sample Variance 


Suppose that our data form a random sample from a normal distribution. The 
sample mean ji and sample variance o? are important Statistics that are computed 
in order to estimate the parameters of the normal distribution. Their marginal 
distributions help us to understand how good each of them is as an estimator of 
the corresponding parameter. However, the marginal distribution of j1 depends 
ono. The joint distribution of fi and o? will allow us to make inferences about 
without reference to o. 


Independence of the Sample Mean and Sample Variance 


Rain from Seeded Clouds. Simpson, Olsen, and Eden (1975) describe an experiment 
in which a random sample of 26 clouds were seeded with silver nitrate to see if they 
produced more rain than unseeded clouds. Suppose that, on a log scale, unseeded 
clouds typically produced a mean rainfall of 4. In comparing the mean of the seeded 
clouds to the unseeded mean, one might naturally see how far the average log-rainfall 
of the seeded clouds (1 is from 4. But the variation in rainfall within the sample is also 
important. For example, if one compared two different samples of seeded clouds, 
one would expect the average rainfalls in the two samples to be different just due 
to variation between clouds. In order to be confident that seeding the clouds really 
produced more rain, we would want the average log-rainfall to exceed 4 by a large 
amount compared to the variation between samples, which is closely related to the 
variation within samples. Since we do not know the variance for seeded clouds, we 
compute the sample variance 0”. Comparing ji — 4 to o? requires us to consider the 
joint distribution of the sample mean and the sample variance. < 


Suppose that X;,..., X,, form a random sample from the normal distribution 
with unknown mean yw and unknown variance o?. Then, as was shown in Exam- 
ple 7.5.6, the M.L.E.’s of 4 and o? are the sample mean X,, and the sample variance 
(1/n) YF (X; - X,,)*. In this section, we shall derive the joint distribution of these 
two estimators. 

We already know from Corollary 5.6.2 that the sample mean itself has the normal 
distribution with mean jw and variance o*/n. We shall establish the noteworthy 
property that the sample mean and the sample variance are independent random 
variables, even though both are functions of the same random variables X1,..., X;. 
Furthermore, we shall show that, except for a scale factor, the sample variance has 
the x? distribution with n — 1 degrees of freedom. More precisely, we shall show that 
the random variable 7"_,(X; — X,,)*/o7 has the x? distribution with n — 1 degrees 
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Theorem 
8.3.1 


Example 
8.3.2 


Figure 8.3 Histogram of 
log-rainfalls from seeded 
clouds. 


of freedom. This result is also a rather striking property of random samples from a 
normal distribution, as the following discussion indicates. 

Because the random variables X,,..., X,, are independent, and because each 
has the normal distribution with mean jw and variance o”, the random variables 
(X, — )/o,..., (X, — “)/o are also independent, and each of these variables has 
the standard normal distribution. It follows from Corollary 8.2.1 that the sum of their 
squares ear 6. ¢ ; — “)*/o7 has the x? distribution with n degrees of freedom. Hence, 
the striking property mentioned in the previous paragraph is that if the population 
mean y is replaced by the sample mean X,, in this sum of squares, the effect is simply 
to reduce the degrees of freedom in the x? distribution from n to n — 1. In summary, 
we shall establish the following theorem. 


Suppose that X;,..., X, form a random sample from the normal distribution 
with mean jz and variance o”. Then the sample mean X,, and the sample variance 
jn) FO =X n)- are independent random variables, X,, has the normal distribu- 
tion with mean w and variance o”/n, and Yate X,,)*/o7 has the x? distribution 
with n — 1 degrees of freedom. 


Furthermore, it can be shown that the sample mean and the sample variance are 
independent only when the random sample is drawn from a normal distribution. We 
shall not consider this result further in this book. However, it does emphasize the 
fact that the independence of the sample mean and the sample variance is indeed a 
noteworthy property of samples from a normal distribution. 

The proof of Theorem 8.3.1 makes use of transformations of several variables as 
described in Sec. 3.9 and of the properties of orthogonal matrices. The proof appears 
at the end of this section. 


Rain from Seeded Clouds. Figure 8.3 is a histogram of the logarithms of the rainfalls 
from the seeded clouds in Example 8.3.1. Suppose that these logarithms X,..., X26 
are modeled as i.i.d. normal random variables with mean jz and variance o?. If we 
are interested in how much variation there is in rainfall among the seeded clouds, 
we can compute the sample variance o2 = pRenes ; — X,)*/26. The distribution of 
U =2607/c? is the x? distribution with 25 degrees of freedom. We can use this 
distribution to tell us how likely it is that o? will overestimate or underestimate o7 
by various amounts. For example, the x? table in this book says that the 0.25 quantile 
of the x? distribution with 25 degrees of freedom is 19.94, so Pr(U < 19.94) = 0.25. 


9 2 4 6 8 
log(rainfall) 
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It follows that 


7) 
pen ei asd 
o2 26 


= Pr(o2 < 0.770”). (8.3.1) 


That is, there is probability 0.25 that o2 will underestimate o? by 23 percent or more. 
The observed value of o? is 2.460 in this example. The probability calculated in 
Eq. (8.3.1) has nothing to do with how far 2.460 is from o*. Eq. (8.3.1) tells us the 
probability (prior to observing the data) that 2 would be at least 23% belowo*. < 


Estimation of the Mean and Standard Deviation 


We shall assume that X,,..., X,, form arandom sample from the normal distribution 
with unknown mean jw and unknown standard deviation o. Also, as usual, we shall 
denote the M.L.E.’s of 4 and o by ji anda. Thus, 


1. 1/2 
h=X, and s-(} 30% - 7] . 
i=l 


~ 


Notice that ¢? = 02, the M.L.E. of o”. For the remainder of this book, when referring 
to the M.L.E. of o”, we shall use whichever symbol 6? or o? is more convenient. As an 
illustration of the application of Theorem 8.3.1, we shall now determine the smallest 
possible sample size n such that the following relation will be satisfied: 


Pr( Ia —pl< 20 and |¢—o|< 7) > > (8.3.2) 


In other words, we shall determine the minimum sample size n for which the proba- 
bility will be at least 1/2 that neither jz nor o will differ from the unknown value it is 
estimating by more than (1/5)o. 

Because of the independence of jz and G, the relation (8.3.2) can be rewritten as 
follows: 


1 1 1 
Pri |a— —o ) Pr{ |o — = =, 8.3.3 
(i-mche)m(e-aichjeh aa 


If we let p, denote the first probability on the left side of the relation (8.3.3), and let 
U be a random variable that has the standard normal distribution, this probability 
can be written in the following form: 


p= Pr( EH 2 vi) =Pr( iu < evi). 


Similarly, if we let p, denote the second probability on the left side of the relation 
(8.3.3), and let V =n6*/o”, this probability can be written in the following form: 


aA a? 
P= Pr(08 Be 12) = P(t <“< 144) 
Oo Oo 


= Pr(0.64n < V <1.44n). 


By Theorem 8.3.1, the random variable V has the x? distribution with n — 1 degrees 
of freedom. 

For each specific value of n, the values of p; and p> can be found, at least 
approximately, from the table of the standard normal distribution and the table of 
the x? distribution given at the end of this book. In particular, after various values 
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of n have been tried, it will be found that for n = 21 the values of p; and p> are 
p, = 0.64 and pz = 0.78. Hence, p;p2 = 0.50, and it follows that the relation (8.3.2) 
will be satisfied for n = 21. 


Proof of Theorem 8.3.1 


Definition 
8.3.1 


Theorem 
8.3.2 


Theorem 
8.3.3 


We already knew from Corollary 5.6.2 that the distribution of the sample mean was 
as stated in Theorem 8.3.1. What remains to prove is the stated distribution of the 
sample variance and the independence of the sample mean and sample variance. 


Orthogonal Matrices 


We begin with some properties of orthogonal matrices that are essential for the proof. 


Orthogonal Matrix. It is said that ann x n matrix A is orthogonal if A~! = A’, where 
A’ is the transpose of A. 


In other words, a matrix A is orthogonal if and only if AA’ = A’/A = J, where J is the 
n x n identity matrix. It follows from this latter property that a matrix is orthogonal 
if and only if the sum of the squares of the elements in each row is 1 and the sum 
of the products of the corresponding elements in every pair of different rows is 0. 
Alternatively, a matrix is orthogonal if and only if the sum of the squares of the 
elements in each column is 1 and the sum of the products of the corresponding 
elements in every pair of different columns is 0. 


Properties of Orthogonal Matrices We shall now derive two important properties 
of orthogonal matrices. 


Determinant is |. If A is orthogonal, then |det A| = 1. 


Proof ‘To prove this result, it should be recalled that det A = det A’ for every square 
matrix A. Also recall that det AB = (det A)(det B) for square matrices A and B. 
Therefore, 


det(AA’) = (det A)(det A’) = (det A)’. 
Also, if A is orthogonal, then AA’ = J, and it follows that 
det(AA’) = det J =1. 
Hence (det A)” = 1 or, equivalently, |det A| = 1. a 


Squared Length Is Preserved. Consider two n-dimensional random vectors 
xX} Yi 
X=| : and Y=] : |, (8.3.4) 
Xp Y, 


and suppose that Y = AX, where A is an orthogonal matrix. Then 


n n 
Pay xX. (8.3.5) 
i=l i=l 


Theorem 
8.3.4 
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Proof This result follows from the fact that A‘’A = I, because 


N n 

2 2 
) Y; =Y'Y =X'A/AX = X'X = y X*. a 
i=] i=1 


Multiplication of a vector X by an orthogonal matrix A corresponds to a rotation 
of X in n-dimensional space possibly followed by changing the signs of some coor- 
dinates. Neither of these operations can change the length of the original vector X, 
and that length equals ()-?_, X27)”. 

Together, these two properties of orthogonal matrices imply that if a random 
vector Y is obtained from a random vector X by an orthogonal linear transformation 
Y = AX, then the absolute value of the Jacobian of the transformation is 1 and 
aa Y} = et Xj. 

We combine Theorems 8.3.2 and 8.3.3 to obtain a useful fact about orthogonal 
transformations of a random sample of standard normal random variables. 


Suppose that the random variables, X;,..., X, are i.i.d. and each has the standard 
normal distribution. Suppose also that A is an orthogonal n x n matrix, and Y= AX. 
Then the random variables Y;,..., Y,, are alsoi.i.d., each also has the standard normal 
distribution, and 7"_, X7 = 0", Y? 


i=l ~i* 


Proof The joint p.d.f. of X;,..., X,, is as follows, for —co < x; < oo (i =1,..., 7): 


_ wt, A n 
fr(x) = ny e( , dX “) (8.3.6) 


If Ais an orthogonal x n matrix, and the random variables Y;,..., Y,, are defined by 
the relation Y = AX, where the vectors X and Y are as specified in Eq. (8.3.4). This is 
a linear transformation, so the joint p.d.f. of Y;,..., Y,, is obtained from Eq. (3.9.20) 
and equals 


_. 4 =i 
nly) = deta" y). 


Let x = A7ly. Since A is orthogonal, |det A| = 1 and 7"_, y7 = 7”_, x7, as we just 
proved. So, 


1 1 n F 
= ——~ exp| —- ae 8.3.7 
FD = oan r( 5 i) (8.3.7) 
It can be seen from Eq. (8.3.7) that the joint p.df. of Y;,..., Y,, is exactly the 
same as the joint p.d.f. of X1,..., X,. | 


Proof of Theorem 8.3.1 


Random Samples from the Standard Normal Distribution We shall begin by 
proving Theorem 8.3.1 under the assumption that X,..., X,, form arandom sample 
from the standard normal distribution. Consider the n-dimensional row vector u, in 
which each of the n components has the value 1/,/n: 


“= E es =| (8.3.8) 


Since the sum of the squares of the n components of the vector w is 1, it is possible 
to construct an orthogonal matrix A such that the components of the vector u form 
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the first row of A. This construction, called the Gram-Schmidt method, is described 
in textbooks on linear algebra such as Cullen (1972) and will not be discussed here. 
We shall assume that such a matrix A has been constructed, and we shall again define 
the random variables Y;,..., Y,, by the transformation Y = AX. 

Since the components of wu form the first row of A, it follows that 


Y,=uX = eae aft & ys (8.3.9) 


Furthermore, by Theorem 8.3.4, 7", X? = '_, Y?. Therefore, 


a n nA n 
—2 = 
Yi ve=So¥?-¥? => xX? - 2X, = (X% - X,). 
i=2 i=l i=l i=l 
We have thus obtained the relation 
n n 

v=) = zy. (8.3.10) 
i= —_ 


It is known from Theorem 8.3.4 that the random variables Y;,..., Y,, are in- 
dependent. Therefore, the two random variables Y, and )7"_, Y? are independent, 
and it follows from Eqs. (8.3.9) and (8.3.10) that X,, and )7”_,(X; — X,)? are in- 
dependent. Furthermore, it is known from Theorem 8.3.4 that the n — 1 random 
variables Y>,..., Y,, are 1.i.d., and that each of these random variables has the stan- 
dard normal distribution. Hence, by Corollary 8.2.1 the random variable )7"_, Y? 
has the x? distribution with n — 1 degrees of freedom. It follows from Eq. (8.3.10) 


that )°"_)(X; -— X,,)° also has the x? distribution with n — 1 degrees of freedom. 


Random Samples from an Arbitrary Normal Distribution Thus far, in proving 
Theorem 8.3.1, we have considered only random samples from the standard normal 
distribution. Suppose now that the random variables X;,..., X,, form a random 
sample from an arbitrary normal distribution with mean and variance o?. 

If we let Z; = (X; — w)/o fori =1,...,n, then the random variables Z,,..., Z,, 
are independent, and each has the standard normal distribution. In other words, the 
joint distribution of Z),..., Z, is the same as the joint distribution of a random 
sample from the standard normal distribution. It follows from the results that have 


just been obtained that Z,, and DY ja1(Zi - Z,,)° are independent, and 1(Z; - 2. 


has the x? distribution with n — 1 degrees of freedom. However, Z,, = (X, — )/o 
and 

n 1 n 

i=l i=1 


We now conclude that the sample mean X,, and the sample variance (1/n) )-7_,(X; — 


X,,)° are independent, and that the random variable on the right side of Eq. (8.3.11) 
has the x? distribution with n — 1 degrees of freedom. All the results stated in 
Theorem 8.3.1 have now been established. 


nS 


¢ 
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Summary 


Let X41, 


.., X, be a random sample from the normal distribution with mean jw 


n 


and variance o7. Then the sample mean ji = X, = i »3/_- X; and sample variance 


gut 
n 


4 X,,)* are independent random variables. Furthermore, ji has the 


normal distribution with mean yw and variance o*/n, and no2/o? has a chi-square 
distribution with n — 1 degrees of freedom. 


Exercises 
1. Assume that X1,..., X, form a random sample from 
the normal distribution with mean mw and variance oa. 


Show that o? has the gamma distribution with parameters 
(n — 1)/2 and n/(20”). 


2. Determine whether or not each of the five following 
matrices is orthogonal: 


0 1 0 08 O 0.6 
a. 0 0 1 b. —0.6 0 08 
1 0 0 0-1 O 


1 1 1 
08 0 0.6 v3 v3 v3 
1 1 1 
«| -06 0 os |a AA OW 
005 O : ' fi 
V3 v3 OB 
JL 4 2 
2: 2 2 2 
fot fl 
2 2 2 2 
ol a oa oi 
2 2 2 2 
at 4 ‘tf A 
2 2 2 2, 


3.a. Construct a 2 x 2 orthogonal matrix for which the 
first row is as follows: 


1 1 
ly vl 
b. Construct a 3 x 3 orthogonal matrix for which the 
first row is as follows: 
[as Be ee] 
V3 V3 30° 
4. Suppose that the random variables X,, X, and X3 are 


1.i.d., and that each has the standard normal distribution. 
Also, suppose that 


Y; =0.8X, + 0.6X, 
Y, = V2(0.3X, — 0.4X> — 0.5X3), 
¥3 = V2(0.3X, — 0.4X> + 0.5X3). 


Find the joint distribution of Y,, Y>, and Y3. 


5. Suppose that the random variables X; and X are inde- 
pendent, and that each has the normal distribution with 
mean yw and variance o”. Prove that the random variables 
X,+ X> and X, — X> are independent. 


6. Suppose that X,,..., X, form a random sample from 
the normal distribution with mean jz and variance o2. As- 
suming that the sample size n is 16, determine the values 
of the following probabilities: 

a. Pr| 307 < By 


n 


(Xi — w)? < 20? 


b. Prf3o2 <4 


n 


"(k= 2, 2 207] 


7. Suppose that X;,..., X, form a random sample from 

the normal distribution with mean yw and variance o?,and 

let o2 denote the sample variance. Determine the smallest 

values of n for which the following relations are satisfied: 
2 


a. BS < 13) > 0.95 


b. Pr(Io? ~o\< $0?) > 0.8 


8. Suppose that X has the x? distribution with 200 degrees 
of freedom. Explain why the central limit theorem can be 
used to determine the approximate value of Pr(160 < X < 
240) and find this approximate value. 


9. Suppose that each of two statisticians, A and B, inde- 
pendently takes a random sample of 20 observations from 
the normal distribution with unknown mean pu and known 
variance 4. Suppose also that statistician A finds the sam- 
ple variance in his random sample to be 3.8, and statis- 
tician B finds the sample variance in her random sample 
to be 9.4. For which random sample is the sample mean 
likely to be closer to the unknown value of jz? 
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Example 
8.4.1 


Definition 
8.4.1 


Theorem 
8.4.1 


8.4 Thet Distributions 


When our data are a sample from the normal distribution with mean ju and vari- 
ance 02, the distribution of Z= wri 1t)/o is the standard normal distribution, 
where ji is the sample mean. If 0” is unknown, we can replace o by an estimator 
(similar to the M.L.E.) in the formula for Z. The resulting random variable has 
the t distribution with n — 1 degrees of freedom and is useful for making inferences 
about alone even when both wt and o? are unknown. 


Definition of the Distributions 


Rain from Seeded Clouds. Consider the same sample of log-rainfall measurements 
from 26 seeded clouds from Example 8.3.2. Suppose now that we are interested in 
how far the sample average X,, of those measurements is from the mean jz. We know 
that n'/2(X,, — )/o has the standard normal distribution, but we do not know o. If 
we replace o by an estimator 6 such as the M.L.E., or something similar, what is the 
distribution of n!/?(X,, — w)/6, and how can we make use of this random variable to 
make inferences about 1? < 


In this section, we shall introduce and discuss another family of distributions, 
called the t distributions, which are closely related to random samples from a normal 
distribution. The r distributions, like the x7 distributions, have been widely applied in 
important problems of statistical inference. The t distributions are also known as Stu- 
dent’s distributions (see Student, 1908), in honor of W. S. Gosset, who published his 
studies of this distribution in 1908 under the pen name “Student.” The distributions 
are defined as follows. 


t Distributions. Consider two independent random variables Y and Z, such that Y 
has the x? distribution with m degrees of freedom and Z has the standard normal 
distribution. Suppose that a random variable X is defined by the equation 


Z 


_ on 
m 


Then the distribution of X is called the t distribution with m degrees of freedom. 


(8.4.1) 


The derivation of the p.d.f. of the ¢ distribution with m degrees of freedom makes 
use of the methods of Sec. 3.9 and will be given at the end of this section. But we state 
the result here. 


Probability Density Function. The p.d.f. of the r distribution with m degrees of freedom 


is 

r (2+) 2 —(m+1)/2 

1+ for —00 <x < ov. (8.4.2) 
(mx)"/?1 (3) m 


Moments of the t Distributions Although the mean of the ¢ distribution does not 
exist when m < 1, the mean does exist for every value of m > 1. Of course, whenever 
the mean does exist, its value is 0 because of the symmetry of the ¢ distribution. 


Example 
8.4.2 


Theorem 
8.4.2 
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In general, if a random variable X has the ¢ distribution with m degrees of 
freedom (m > 1), then it can be shown that E(|X|*) < oo fork < mand that E(|X|‘) = 
oo for k > m. If m is an integer, the first m — 1 moments of X exist, but no moments 
of higher order exist. It follows, therefore, that the m.g.f. of X does not exist. 

It can be shown (see Exercise 1 at the end of this section) that if X has the r 
distribution with m degrees of freedom (m > 2), then Var(X) = m/(m — 2). 


Relation to Random Samples from a Normal Distribution 


Rain from Seeded Clouds. Return to Example 8.4.1. We have already seen that Z = 
n'/2(X,, — )/o has the standard normal distribution. Furthermore, Theorem 8.3.1 
says that X,, (and hence Z) is independent of Y =no2/o7, which has the x? dis- 
tribution with n — 1 degrees of freedom. It follows that Z/(Y/[n — 1])’”? has the t 
distribution with n — 1 degrees of freedom. We shall show how to use this fact after 
stating the general version of this result. <l 


Suppose that X,,..., X, form a random sample from the normal distribution with 
mean yw and variance o*. Let X,, denote the sample mean, and define 


= 1/2 
, a (X; ~~ X,)° 
oa | Bi) : (8.4.3) 


Then n//*(X,, — w)/o’ has the t distribution with n — 1 degrees of freedom. 


Proof Define S? = )~’_,(X; — X,,). Next, define Z =n'/?(X,, — w)/o and Y = S?/o?. 
It follows from Theorem 8.3.1 that Y and Z are independent, Y has the x? distribution 
with n — 1 degrees of freedom, and Z has the standard normal distribution. Finally, 
define U by 


_ Z 


( y 
n—-1 


It follows from the definition of the ¢ distribution that U has the rt distribution with 
n — 1 degrees of freedom. It is easily seen that U can be rewritten as 


te =. (8.4.4) 


The denominator of the expression on the right side of Eq. (8.4.4) is easily recognized 
as o’ defined in Eq. (8.4.3). | 


The first rigorous proof of Theorem 8.4.2 was given by R. A. Fisher in 1923. 

One important aspect of Eq. (8.4.4) is that neither the value of U nor the 
distribution of U depends on the value of the variance o”. In Example 8.4.1, we tried 
replacing o in the random variable Z = n'/*(X,, — )/o by G. Instead, Theorem 8.4.2 
suggests that we should replace o by o’ defined in Eq. (8.4.3). If we replace o by o’, 
we produce the random variable U in Eq. (8.4.4) that does not involve o and also 
has a distribution that does not depend ono. 
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Example 
8.4.3 


The reader should notice that o’ differs from the M.L.E. 6 of o by a constant 


factor, 
1/2 
82 1/2 
S|! ( . ) 6. (8.4.5) 
n—-1 n—-1 


It can be seen from Eq. (8.4.5) that for large values of n the estimators o’ and 6 will 
be very close to each other. The estimator o’ will be discussed further in Sec. 8.7. 

If the sample size n is large, the probability that the estimator o’ will be close too 
is high. Hence, replacing o by o’ in the random variable Z will not greatly change the 
standard normal distribution of Z. For this reason, it is plausible that the ¢ distribution 
with n — 1 degrees of freedom should be close to the standard normal distribution if 
nis large. We shall return to this point more formally later in this section. 


Rain from Seeded Clouds. Return to Example 8.4.2. Under the assumption that the 
observations X;,..., X,, (log-rainfalls) are independent with common normal distri- 
bution, the distribution of U =n!/?(X,, — )/o’ is the r distribution with n — 1 degrees 
of freedom. With n = 26, the table of the ¢ distribution tells us that the 0.9 quantile 
of the ¢ distribution with 25 degrees of freedom is 1.316, so Pr(U < 1.316) = 0.9. It 
follows that 


Pr(X, <ut 0.25810’) ~0.9, 


because 1.316/(26)!/* = 0.2581. That is, the probability is 0.9 that X,, will be no more 
than 0.2581 times o’ above jz. Of course, a’ is arandom variable as well as X,,, So this 
result is not as informative as we might have hoped. In Sections 8.5 and 8.6, we will 
show how to make use of the rf distribution to make some standard inferences about 
the unknown mean ju. < 


Relation to the Cauchy Distribution and to the Standard 
Normal Distribution 


It can be seen from Eq. (8.4.2) (and Fig. 8.4) that the p.d.f. g(x) is a symmetric, bell- 
shaped function with its maximum value at x = 0. Thus, its general shape is similar 
to that of the p.d.f. of a normal distribution with mean 0. However, as x — oo or 
x — —o0, the tails of the p.d-f. g(x) approach 0 much more slowly than do the tails 
of the p.d.f. of a normal distribution. In fact, it can be seen from Eq. (8.4.2) that the r 
distribution with one degree of freedom is the Cauchy distribution, which was defined 
in Example 4.1.8. The p.d.f. of the Cauchy distribution was sketched in Fig. 4.3. It 
was shown in Example 4.1.8 that the mean of the Cauchy distribution does not exist, 
because the integral that specifies the value of the mean is not absolutely convergent. 
It follows that, although the p.d.f. of the ¢ distribution with one degree of freedom 
is symmetric with respect to the point x = 0, the mean of this distribution does not 
exist. 

It can also be shown from Eq. (8.4.2) that, as n — ov, the p.d.f. g(x) converges to 
the p.d.f. @(x) of the standard normal distribution for every value of x (—oo < x < ov). 
This follows from Theorem 5.3.3 and the following result: 


r(m +3) 


Figure 8.4 p.d.f’s of stan- 
dard normal and ¢ distribu- 
tions. 
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Density A 


Normal 
Cauchy 
esewseees 5 Degrees 
of Freedom 
—-—— 20 Degrees 
of Freedom 


(See Exercise 7 for a way to prove the above result.) Hence, when 7 is large, the t 
distribution with n degrees of freedom can be approximated by the standard normal 
distribution. Figure 8.4 shows the p.d.f. of the standard normal distribution together 
with the p.d.f.’s of the ¢ distributions with 1, 5, and 20 degrees of freedom so that the 
reader can see how the ¢ distributions get closer to normal as the degrees of freedom 
increase. 

A short table of p quantiles for the ¢ distribution with m degrees of freedom for 
various values of p and m is given at the end of this book. The probabilities in the 
first line of the table, corresponding to m = 1, are those for the Cauchy distribution. 
The probabilities in the bottom line of the table corresponding to m = oo are those 
for the standard normal distribution. Most statistical packages include a function to 
compute the c.d.f. and the quantile function of an arbitrary f distribution. 


@ | Derivation of the p.d.f. 


Suppose that the joint distribution of Y and Z is as specified in Definition 8.4.1. Then, 

because Y and Z are independent, their joint p.d-f. is equal to the product f;(y) fo(z), 

where f;(y) is the p.d.f. of the x? distribution with m degrees of freedom and f(z) is 

the p.d.f. of the standard normal distribution. Let X be defined by Eq. (8.4.1) and, as 

a convenient device, let W = Y. We shall determine first the joint p.d.f. of X and W. 
From the definitions of X and W, 


W 


1/2 
Z=xX (“) and Y=W. (8.4.7) 


m 


The Jacobian of the transformation (8.4.7) from X and W to Y and Z is (W/m)!/?. 
The joint p.d-f. f(x, w) of X and W can be obtained from the joint p.d.f. f(y) fo(z) by 
replacing y and z by the expressions given in (8.4.7) and then multiplying the result 
by (w/m)'/2, Itis then found that the value of f(x, w) is as follows, for —oo < x < 00 


and w > 0: 
w 12/2 wy 1/2 
f(x, w) = fiw) fo (: =| (“) 
m m 


2 
= cw" DP] ex 3 ( + ) | 
2 m 


(8.4.8) 


where 
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relation 


Summary 


Let X41, a 


and variance o”. Let X,, = 


-1 
= [20+ ¥r (3) ; 
2 


The marginal p.d-f. g(x) of X can be obtained from Eq. (8.4.8) by using the 


ec) =f f(x, w) dw 
=c is w"*D/2-! expl—wh(x)] dw, 
0 


where h(x) = [1 + x?/m]/2. It follows from Eq. (5.7.10) that 


P(m + 1)/2) 


g(x) — “Thy @tb/2- ; 


Substituting the formula for c into this yields the function in (8.4.2). 


., X, be a random sample from the normal distribution with mean ju 


1 n / 1 n WT \2 1/2 
n pe X; and oOo = (4 pee 0. = Xn) ) . Then the 


distribution of n'/?(X,, — )/o’ is the ¢ distribution with n — 1 degrees of freedom. 


Exercises 


1. Suppose that X has the f distribution with m degrees 
of freedom (m > 2). Show that Var(X) = m/(m — 2). Hint: 
To evaluate E(X”), restrict the integral to the positive half 
of the real line and change the variable from x to 


x2 


m 
y => 


= 
eae 
m 


Compare the integral with the p.d-f. of a beta distribution. 
Alternatively, use Exercise 21 in Sec. 5.7. 


2. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean w and un- 
known standard deviation o, and let and G denote the 
M.L.E.’s of 2 and o. For the sample size n = 17, find a 
value of k such that 


Pr( > w+ko) = 0.95. 
3. Suppose that the five random variables X;,..., X5 are 


1.i.d. and that each has the standard normal distribution. 
Determine a constant c such that the random variable 


c(Xy + X2) 
(XZ + XF+ X21 


will have at distribution. 


4. By using the table of the ¢ distribution given in the back 
of this book, determine the value of the integral 


‘s ae 
—oo (12 + x2)?" 
5. Suppose that the random variables X, and X> are in- 


dependent and that each has the normal distribution with 
mean 0 and variance o”. Determine the value of 


2 
Pr 6. <4]. 
(Xq — Xp)? 


Hint: 


2 
X, +X 
ay agy=2 | (x, 4%) 


2 
X X 
+ (m- 4%)’, 


6. In Example 8.2.3, suppose that we will observe n = 20 
cheese chunks with lactic acid concen- 

trations X,,..., X99. Find a number c so that 

Pr(Xn9 < uw +0’) = 0.95. 


7. Prove the limit formula Eq. (8.4.6). Hint: Use Theo- 
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8. Let X have the standard normal distribution, and let 
Y have the ¢ distribution with five degrees of freedom. 
Explain why c = 1.63 provides the largest value of the 
difference Pr(—c < X <c) — Pr(—c < Y <c). Hint: Start 
by looking at Fig. 8.4. 


rem 5.7.4. 


Example 


8.5.1 


8.5 Confidence Intervals 


Confidence intervals provide a method of adding more information to an estimator 
6 when we wish to estimate an unknown parameter 6. We can find an interval 
(A, B) that we think has high probability of containing 6. The length of such an 
interval gives us an idea of how closely we can estimate 0. 


Confidence Intervals for the Mean of a Normal Distribution 


Rain from Seeded Clouds. In Example 8.3.2, the average of the n = 26 log-rainfalls 
from the seeded clouds is X,,. This may be a sensible estimator of the jz, the mean 
log-rainfall from a seeded cloud, but it doesn’t give any idea how much stock we 
should place in the estimator. The standard deviation of X,, iso /(26)'/, and we could 
estimate o by an estimator like o’ from Eq. (8.4.3). Is there a sensible way to combine 
these two estimators into an inference that tells us both what we should estimate for 
j and how much confidence we should place in the estimator? «J 


Assume that X,,..., X,, form a random sample from the normal distribution 
with mean jz and variance o7. Construct the estimators X,, of and o’ of o . We shall 
now show how to make use of the random variable 

fim nerd, — K#) 


i 


(8.5.1) 


oO 


from Eq. (8.4.4) to address the question at the end of Example 8.5.1. We know that U 
has the ¢ distribution with n — 1 degrees of freedom. Hence, we can calculate the c.d.f. 
of U and/or quantiles of U using either statistical software or tables such as those 
in the back of this book. In particular, we can compute Pr(—c < U <c) for every 
c > 0. The inequalities —c < U <c can be translated into inequalities involving yz. by 
making use of the formula for U in Eq. (8.5.1). Simple algebra shows that —c < U <c 
is equivalent to 


ve co’ 


ni/2’ 
Whatever probability we can assign to the event {—c < U < c} we can also assign to 
the event that Eq. (8.5.2) holds. For example, if Pr(—c < U <c) = y, then 


co => 
Xn ay <M KX t+ 


oo (8.5.2) 


= co’ — co’ 
Pr (%,- 7 <u <%, +S) =v. 


(8.5.3) 


One must be careful to understand the probability statement in Eq. (8.5.3) as being 
a statement about the joint distribution of the random variables X,, and o’ for fixed 
values of jz and o. That is, it is a statement about the sampling distribution of X,, and 
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Example 
8.5.2 


Definition 
8.5.1 


Theorem 
8.5.1 


o’, and is conditional on yz and o. In particular, it is not a statement about pw even if 
we treat jz as a random variable. 

The most popular version of the calculation above is to choose y and then figure 
out what c must be in order to make (8.5.3) true. That is, what value of c makes 
Pr(—c < U <c)=y? Let T,_; denote the c.d-f. of the r distribution with n — 1 degrees 
of freedom. Then 


y =Pr(-—c < U <c)=T,_4(c) — T,_\(—c). 


Since the ¢ distributions are symmetric around 0, 7,,\(—c) = 1 — T,_\(c), so y = 
2T,-1(c) — 1 or, equivalently, c = T,,({1+ y]/2). That is, c must be the (1+ y)/2 
quantile of the f distribution with n — 1 degrees of freedom. 


Rain from Seeded Clouds. In Example 8.3.2, we have n = 26. If we want y = 0.95 in 
Eq. (8.5.3), then we need c to be the 1.95/2 = 0.975 quantile of the r distribution with 
25 degrees of freedom. This can be found in the table of ¢ distribution quantiles in the 
back of the book to be c = 2.060. We can plug this value into Eq. (8.5.3) and combine 
the constants c/n'/? = 2.060/26'/? = 0.404. Then Eq. (8.5.3) states that regardless of 
the unknown values of jz and o, the probability is 0.95 that the two random variables 
A=X,, — 0.4040’ and B = X,, + 0.4040’ will lie on opposite sides of ju. < 


The interval (A, B), whose endpoints were computed at the end of Example 8.5.2, 
is called a confidence interval. 


Confidence Interval. Let X¥ = (X),..., X,,) be a random sample from a distribution 
that depends on a parameter (or parameter vector) 6. Let g(@) be a real-valued 
function of 6. Let A < B be two statistics that have the property that for all values 
of 6, 


Pr(A < g(0) < B)>y. (8.5.4) 


Then the random interval (A, B) is called a coefficient y confidence interval for g(0) 
or a 100y percent confidence interval for g(6). If the inequality “> y” in Eq. (8.5.4) 
is an equality for all 0, the confidence interval is called exact. After the values of the 
random variables X;,..., X,, in the random sample have been observed, the values 
of A =a and B = b are computed, and the interval (a, b) is called the observed value 
of the confidence interval. 


In Example 8.5.2, 6 = (u, o), and the interval (A, B) found in that example is an 
exact 95% confidence interval for g(6) = yw. 

Based on the discussion preceding Definition 8.5.1, we have established the 
following. 


Confidence Interval for the Mean of a Normal Distribution. Let X),..., X,, be arandom 
sample from the normal distribution with mean yw and variance o”.Foreach0 <y <1, 
the interval (A, B) with the following endpoints is an exact coefficient y confidence 
interval for pu: 


Figure 8.5 A sample of 
one hundred observed 95% 
confidence intervals based 
on samples of size 26 from 
the normal distribution with 
mean yp =5.1 and standard 
deviation o = 1.6. In this 
figure, 94% of the intervals 
contain the value of ju. 
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Rain from Seeded Clouds. In Example 8.5.2, the average of the 26 log-rainfalls from 
the seeded clouds is X,, = 5.134. The observed value of o’ is 1.600. The observed 
values of A and B are, respectively, a = 5.134 — 0.404 x 1.600 = 4.488 and b = 5.134 + 
0.404 x 1.600 = 5.780. The observed value of the 95% confidence interval is then 
(4.488, 5.780). For comparison, the mean unseeded level of 4 is a bit below the lower 
endpoint of this interval. <l 


Interpretaton of Confidence Intervals The interpretation of the confidence inter- 
val (A, B) defined in Definition 8.5.1 is straightforward, so long as one remembers 
that Pr(A < g(@) < B) = y isa probability statement about the joint distribution of 
the two random variables A and B givena particular value of 9. Once we compute the 
observed values a and b, the observed interval (a, b) is not so easy to interpret. For 
example, some people would like to interpret the interval in Example 8.5.3 as mean- 
ing that we are 95% confident that yz is between 4.488 and 5.780. Later in this section, 
we shall show why such an interpretation is not safe in general. Before observing the 
data, we can be 95% confident that the random interval (A, B) will contain ju, but 
after observing the data, the safest interpretation is that (a, b) is simply the observed 
value of the random interval (A, B). One way to think of the random interval (A, B) 
is to imagine that the sample that we observed is one of many possible samples that 
we could have observed (or may yet observe in the future). Each such sample would 
allow us to compute an observed interval. Prior to observing the samples, we would 
expect 95% of the intervals to contain jz. Even if we observed many such intervals, 
we won’t know which ones contain and which ones don’t. Figure 8.5 contains a 
plot of 100 observed values of confidence intervals, each computed from a sample of 
size n = 26 from the normal distribution with mean j= 5.1 and standard deviation 
o = 1.6. In this example, 94 of the 100 intervals contain the value of jz. 


Acid Concentration in Cheese. In Example 8.2.3, we discussed a random sample of 
10 lactic acid measurements from cheese. Suppose that we desire to compute a 90% 
confidence interval for jz, the unknown mean lactic acid concentration. The number c 
that we need in Eq. (8.5.3) when n = 10 and y = 0.9 is the (1 + 0.9)/2 = 0.95 quantile 
of the ¢ distribution with nine degrees of freedom, c = 1.833. According to Eq. (8.5.3), 
the endpoints will be X,, plus and minus 1.8330’/(10)!/*. Suppose that we observe 
the following 10 lactic acid concentrations as reported by Moore and McCabe (1999, 
p. D-1): 


0.86, 1.53, 1.57, 1.81, 0.99, 1.09, 1.29, 1.78, 1.29, 1.58. 
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The average of these 10 values is x, = 1.379, and the value of o’ = 0.3277. The 
endpoints of the observed value of our 90% confidence interval are then 1.379 — 
1.833 x 0.3277/(10)/? = 1.189 and 1.379 + 1.833 x 0.3277/(10)/? = 1.569. < 


Note: Alternative Definitions of Confidence Interval. Many authors define confi- 
dence intervals precisely as we have done here. Some others define the confidence 
interval to be what we called the observed value of the confidence interval, namely, 
(a, b), and they need another name for the random interval (A, B). Throughout this 
book, we shall stay with the definition we have given, but the reader who studies 
statistics further might encounter the other definition at a later date. Also, some 
authors define confidence intervals to be closed intervals rather than open intervals. 


One-Sided Confidence Intervals 


Rain from Seeded Clouds. Suppose that we are interested only in obtaining a lower 
bound on yw, the mean log-rainfall of seeded clouds. In the spirit of confidence 
intervals, we could then seek a random variable A such that Pr(A < yw) = y. If we 
let B = oo in Definition 8.5.1, we see that (A, oo) is then a coefficient y confidence 
interval for pw. <1 


For a given confidence coefficient y, it is possible to construct many different 
confidence intervals for w. For example, let y. > y; be two numbers such that y) — 
y, = y, and let U be as in Eq. (8.5.1). Then 


Pr (7,0) <U <7) =7, 
and the following statistics are the endpoints of a coefficient y confidence interval 
for ju: 


n— 


_ 7 al = at o! 
A=X,+T, i) TR and B=X,+ 7,102) a3 


Among all such coefficient y confidence intervals, the symmetric interval with y; = 
1 — y is the shortest one. 

Nevertheless, there are cases, such as Example 8.5.5, in which an asymmetric 
confidence interval is useful. In general, it is a simple matter to extend Definition 8.5.1 
to allow either A = —oo or B = oo so that the confidence interval either has the form 
(—oo, B) or (A, ov). 


One-Sided Confidence Intervals/Limits. Let X¥ = (X,..., X,) be a random sample 
from a distribution that depends on a parameter (or parameter vector) 0. Let g(0) 
be a real-valued function of 6. Let A be a statistic that has the property that for all 
values of 6, 


Pr(A < 9(6)) >y. (8.5.5) 


Then the random interval (A, oo) is called a one-sided coefficient y confidence interval 
for g(@) or a one-sided 100y percent confidence interval for g(@). Also, A is called a 
coefficient y lower confidence limit for g(@) or a 100y percent lower confidence limit 
for g(6). Similarly, if B is a statistic such that 


Pr(g(@) < B)> y, (8.5.6) 


then (—oo, B) is a one-sided coefficient y confidence interval for g(@) or a one-sided 
100y percent confidence interval for g (6) and B is a coefficient y upper confidence limit 
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for g(@) or a 100y percent upper confidence limit for g(@). If the inequality “> y” in 
either Eq. (8.5.5) or Eq. (8.5.6) is equality for all 6, the corresponding confidence 
interval and confidence limit are called exact. 


The following result follows in much the same way as Theorem 8.5.1. 


One-Sided Confidence Intervals for the Mean of a Normal Distribution. Let X;,..., X, 
be a random sample from the normal distribution with mean jy and variance o7. 
For each 0 < y <1, the following statistics are, respectively, exact lower and upper 


coefficient y confidence limits for ju: 


ASX, = 140) api 
— =| o’ 
B=X,+T,1 (1) ye a 


Rain from Seeded Clouds. In Example 8.5.5, suppose that we want a 90% lower 
confidence limit for 4. We find Tr. (0.9) = 1.316. Using the observed data from 
Example 8.5.3, we compute the observed lower confidence limit as 


a =5.134 — Nese = 4.727. < 
261/2 


Confidence Intervals for Other Parameters 


Lifetimes of Electronic Components. Recall the company in Example 8.1.3 that is es- 
timating the failure rate 6 of electronic components based on a sample of n =3 
observed lifetimes X,, X>, X3. The statistic T = ae , X; was used in Examples 8.1.4 
and 8.1.5 to make some inferences. We can use the distribution of T to construct con- 
fidence intervals for 0. Recall from Example 8.1.5 that 9T has the gamma distribution 
with parameters 3 and 1 for all 6. Let G stand for the c.d-f. of this gamma distribution. 
Then Pr(6T < G~!(y)) = y for all 9. It follows that Pr(@ < G~'(y)/T) =y for all 4, 
and G~!(y)/T is an exact coefficient y upper confidence limit for 6. For example, if 
the company would like to have a random variable B so that they can be 98% confi- 
dent that the failure rate 6 is bounded above by B, they can find G~!(0.98) = 7.516. 
Then B = 7.516/T is the desired upper confidence limit. < 


In Example 8.5.7, the random variable 6T has the property that its distribution 
is the same for all 6. The random variable U in Eq. (8.5.1) has the property that its 
distribution is the same for all 4. and o. Such random variables greatly facilitate the 
construction of confidence intervals. 


Pivotal. Let X¥ = (Xj,..., X,,) be arandom sample from a distribution that depends 
on a parameter (or vector of parameters) 0. Let V(X, 0) be a random variable whose 
distribution is the same for all 6. Then V is called a pivotal quantity (or simply a 
pivotal). 


In order to be able to use a pivotal to construct a confidence interval for (6), one 
needs to be able to “invert” the pivotal. That is, one needs a function r(v, x) such 
that 


r (V(X, 8), X) = g(6). (8.5.7) 


If such a function exists, then one can use it to construct confidence intervals. 
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Confidence Interval from a Pivotal. Let X = (Xj,..., X,,) be arandom sample from a 
distribution that depends on a parameter (or vector of parameters) 6. Suppose that 
a pivotal V exists. Let G be the c.d.f. of V, and assume that G is continuous. Assume 
that a function r exists as in Eq. (8.5.7), and assume that r(v, x) is strictly increasing in 
uv for each x. Let 0 < y < land let y) > y,; be such that y. — y; = y. Then the following 
statistics are the endpoints of an exact coefficient y confidence interval for g(6): 


A=r (G7), X), 


B=r (cm), x) 


If r(v, x) is strictly decreasing in v for each x, then switch the definitions of A and B. 


Proof Ifr(v, x) is strictly increasing in v for each x, we have 
V(X, 8) < cif and only if g(@) <r(c, X). (8.5.8) 
Let c= G~(y,) in Eq. (8.5.8) for each of i = 1, 2 to obtain 
Pr(g(@) < A)=, 
Pr(g(@) < B)=y. (8.5.9) 
Because V has a continuous distribution and r is strictly increasing, 
Pr(A = g(6)) = Pr(V(X, 0) =G" (71) =0. 


Similarly, Pr(B = g(0)) = 0. The two equations in (8.5.9) combine to give Pr(A < 
g(0) < B) =y. The proof when r is strictly decreasing is similar and is left to the 
reader. a 


Pivotal for Estimating the Variance of a Normal Distribution. Let X,,..., X,, be a ran- 
dom sample from the normal distribution with mean pu and variance o” In Theo- 
rem 8.3.1, we found that the random variable V(X, 0) = )7"_,(X; — X,,)*/o7 has the 
x? distribution with n — 1 degrees of freedom for all 6 = (wu, 07). This makes V a piv- 
otal. The reader can use this pivotal in Exercise 5 in this section to find a confidence 
interval of g(9) =o”. < 


Sometimes pivotals do not exist. This is common when the data have a discrete 
distribution. 


A Clinical Trial. Consider the imipramine treatment group in the clinical trial in 
Example 2.1.4. Let 6 stand for the proportion of successes among a very large 
population of imipramine patients. Suppose that the clinicians desire a random 
variable A such that, for all 0, Pr(A < @) > 0.9. That is, they want to be 90% confident 
that the success proportion is at least A. The observable data consist of the number X 
of successes in a random sample of n = 40 patients. No pivotal exists in this example, 
and confidence intervals are more difficult to construct. In Example 9.1.16, we shall 
see a method that applies to this case. <l 


Even with discrete data, if the sample size is large enough to apply the central 
limit theorem, one can find approximate confidence intervals. 


Approximate Confidence Interval for Poisson Mean. Suppose that X;,..., X,, have the 
Poisson distribution with unknown mean 6. Suppose that n is large enough so that 
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X,, has approximately a normal distribution. In Example 6.3.8 on page 365, we found 
that 


Pr(I2X," —291/2| < c) ~~ 2@(cn'/?) — 1, (8.5.10) 
After we observe X,, = x, Eq. (8.5.10) says that 
(-e + 2x1”, c+ 2x4?) (8.5.11) 


is the observed value of an approximate confidence interval for 20'/? with coefficient 
2@(cn'/*) — 1. For example, if c = 0.196 and n = 100, then 26(cn!/?) — 1 = 0.95. The 
inverse of g(0) = 201? is g~!(y) = y?/4, which is an increasing function of y for 
y > 0. If both endpoints of (8.5.11) are nonnegative, then we know that 26!/? is in 
the interval (8.5.11) if and only if @ is in the interval 


(jl-« $ 2xl/2p, te 4 24 P (8.5.12) 


If —c + 2x!/? <0, the left endpoints of (8.5.11) and (8.5.12) should be replaced by 
0. With this modification, (8.5.12) is the observed value of an approximate coefficient 
2@(cn!/2) — 1 confidence interval for 0. «J 


Shortcoming of Confidence Intervals 


Interpretation of Confidence Intervals Let (A, B) be a coefficient y confidence 
interval for a parameter 0, and let (a, b) be the observed value of the interval. It 
is important to understand that it is not correct to say that 6 lies in the interval 
(a, b) with probability y. We shall explain this point further here. Before the values 
of the statistics A(Xj,..., X,) and B(X,,..., X,) are observed, these statistics are 
random variables. It follows, therefore, from Definition 8.5.1 that @ will lie in the 
random interval having endpoints A(X), ..., X,,) and B(X,, ..., X,,) with probability 
y. After the specific values A(X,,..., X,) =a and B(X,,..., X,) =b have been 
observed, it is not possible to assign a probability to the event that @ lies in the 
specific interval (a, b) without regarding 6 as a random variable, which itself has a 
probability distribution. In order to calculate the probability that 6 lies in the interval 
(a, b), it is necessary first to assign a prior distribution to 6 and then use the resulting 
posterior distribution. Instead of assigning a prior distribution to the parameter 6, 
many statisticians prefer to state that there is confidence y, rather than probability 
y, that 6 lies in the interval (a, b). Because of this distinction between confidence 
and probability, the meaning and the relevance of confidence intervals in statistical 
practice is a somewhat controversial topic. 


Information Can Be Ignored In accordance with the preceding explanation, the 
interpretation of a confidence coefficient y for a confidence interval is as follows: Be- 
fore a sample is taken, there is probability y that the interval that will be constructed 
from the sample will include the unknown value of @. After the sample values are 
observed, however, there might be additional information about whether or not the 
interval formed from these particular values actually does include 6. How to adjust 
the confidence coefficient y in the light of this information is another controversial 
topic. 
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Figure 8.6 p.d.f. of X> in Example 8.5.11. 


Uniforms on an Interval of Length One. Suppose that two observations X, and X> 
are taken at random from the uniform distribution on the interval [ - 7 6+ 3], 


where the value of @ is unknown (—co < @ < ov). If we let ¥; = min{X,, X>} and 
Y, = max{X, X>}, then 


Privy, <O0< Y>) = Pr(xX, <0< X?) + Pr(X> <O0< X4) 
= Pr(X, <6) Pr(X> > 0) + Pr(X> <4) Pr(X; > 0) 
= (1/2)(1/2) + (1/2)(1/2) = 1/2. (8.5.13) 


It follows from Eq. (8.5.13) that (1, Y2) is a confidence interval for 6 with confidence 
coefficient 1/2. However, the analysis can be carried further. 

Since both observations X; and X, must be at least 6 — (1/2), and both must be 
at most 6 + (1/2), we know with certainty that Y, > @ — (1/2) and Yo <6 + (1/2). In 
other words, we know with certainty that 


¥, = tj) <9 = Yj +012). (8.5.14) 


Suppose now that Y; = y, and Y, = y, are observed such that (y. — y,) > 1/2. Then 
y, < yy — (1/2), and it follows from Eq. (8.5.14) that y; <@. Moreover, because 
y, + (1/2) < yo, it also follows from Eq. (8.5.14) that 0 < y). Thus, if (v2 — y,) > 1/2, 
then y; <0 < y. In other words, if (y2 — y,) > 1/2, then we know with certainty that 
the observed value (y,, y>) of the confidence interval includes the unknown value of 
0, even though the confidence coefficient of this interval is only 1/2. 

Indeed, even when (y. — y,) < 1/2, the closer the value of (y) — y;) is to 1/2, the 
more certain we feel that the interval (y;, y2) includes 6. Also, the closer the value 
of (v2 — y;) is to 0, the more certain we feel that the interval (1, y2) does not include 
0. However, the confidence coefficient necessarily remains 1/2 and does not depend 
on the observed values y, and yp. 

This example also helps to illustrate the statement of caution made at the end of 
Sec. 8.1. In this problem, it might seem natural to estimate 6 by X, = 0.5(X; + X9). 
Using the methods of Sec. 3.9, we can find the p.d.f. of X>: 


4x—404+2 if0 -—}<x<0, 
8X)=) 40-4x 42 if6<x<O4+5, 
0 otherwise. 
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Figure 8.6 shows the p.d.f. g, which is triangular. This makes it fairly simple to compute 
the probability that X> is close to 6: 


Pr(|X — 0| <c) =4c(1— 0), 


for 0 < c < 1/2, and the probability is 1 for c > 1/2. For example, if c = 0.3, Pr(|X — 
8| < 0.3) = 0.84. However, the random variable Z = Y, — Y, contains useful informa- 
tion that is not accounted for in this calculation. Indeed, the conditional distribution 


of X> given Z =z is uniform on the interval le — 5(1 —z)O+ 5(1 — 2. We see that 


the larger the observed value of z, the shorter the range of possible values of X. In 
particular, the conditional probability that X, is close to 6 given Z =z is 


2 ife<-2/2, 


(8.5.15) 
1 ifc > (1 —2z)/2. 


Pr(|X —0| <c|Z =z) = | 


For example, if z = 0.1, then Pr(|X — 6| < 0.3|Z = 0.1) = 0.6667, which is quite a bit 
smaller than the marginal probability of 0.84. This illustrates why it is not always safe 
to assume that our estimate is close to the parameter just because the sampling dis- 
tribution of the estimator had high probability of being close. There may be other 
information available that suggests to us that the estimate is not as close as the sam- 
pling distribution suggests, or that it is closer than the sampling distribution suggests. 
(The reader should calculate Pr(|X, — 6| < 0.3|Z = 0.9) for the other extreme.) < 


In the next section, we shall discuss Bayesian methods for analyzing a random 
sample from a normal distribution for which both the mean yw and the variance o? 
are unknown. We shall assign a joint prior distribution to and o”, and shall then 
calculate the posterior probability that 1. belongs to any given interval (a, b). It can 
be shown [see, e.g., DeGroot (1970)] that if the joint prior p.d.f. of 1 and o? is fairly 
smooth and does not assign high probability to any particular small set of values of 
wand o”, and if the sample size n is large, then the confidence coefficient assigned to 
a particular confidence interval (A, B) for the mean wy will be approximately equal 
to the posterior probability that 1 lies in the observed interval (a, b). An example 
of this approximate equality is included in the next section. Therefore, under these 
conditions, the differences between the results obtained by the practical application 
of methods based on confidence intervals and methods based on prior probabilities 
will be small. Nevertheless interpretations of these methods will differ. As an aside, 
a Bayesian analysis of Example 8.5.11 will necessarily take into account the extra 
information contained in the random variable Z. See Exercise 10 for an example. 


Summary 
Let X,,..., X, be a random sample of independent random variables from the nor- 
mal distribution with mean jz and variance o”. Let the observed values be x1, ..., X»- 


Let Xn = 2 Dy X; and a” = 1 yt (X; — X,)?. The interval (X, — co’/n¥/?, 
X + co'/n'/?) is a coefficient y confidence interval for 4, where c is the (1+ y)/2 
quantile of the f distribution with n — 1 degrees of freedom. 
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1. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean pw and known 
variance o*. Let ® stand for the c.d.f. of the standard 
normal distribution, and let ®~! be its inverse. Show that 
the following interval is a coefficient y confidence interval 
for ps if X,, is the observed average of the data values: 


= aflty\o =z aflty\o 
1 5 1 
(x, ® ( 5 \ X,+® ( 5 )<a). 


2. Suppose that a random sample of eight observations is 
taken from the normal distribution with unknown mean 
and unknown variance o”, and that the observed values 
are 3.1, 3.5, 2.6, 3.4, 3.8, 3.0, 2.9, and 2.2. Find the shortest 
confidence interval for jz with each of the following three 
confidence coefficients: (a) 0.90, (b) 0.95, and (c) 0.99. 


3. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean yw and un- 
known variance co“, and let the random variable L denote 
the length of the shortest confidence interval for jw that 
can be constructed from the observed values in the sam- 
ple. Find the value of E(L7) for the following values of the 
sample size n and the confidence coefficient y: 


a.n=5,y =0.95 


b.n = 10, y =0.95 
c.n = 30, y = 0.95 


d.n =8, y =0.90 
en =8, y =0.95 
f.n =8, y =0.99 


4. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean pu and known 
variance o”. How large a random sample must be taken 
in order that there will be a confidence interval for 4 with 
confidence coefficient 0.95 and length less than 0.010? 


5. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean yw and un- 
known variance o”. Describe a method for constructing a 
confidence interval for o” with a specified confidence co- 
efficient y (0 < y < 1). Hint: Determine constants c, and 
co such that 


*(X; — Xp)" 
Pf < a < | =y. 


6. Suppose that X,,..., X, form a random sample from 
the exponential distribution with unknown mean pw. De- 
scribe a method for constructing a confidence interval for 
jt with a specified confidence coefficient y (0 < y <1). 
Hint: Determine constants c; and c) such that Pr[c, < 


(/) Vi Xi < m= y- 


7. In the June 1986 issue of Consumer Reports, some data 
on the calorie content of beef hot dogs is given. Here are 
the numbers of calories in 20 different hot dog brands: 


186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 
152, 111, 141, 153, 190, 157, 131, 149, 135, 132. 


Assume that these numbers are the observed values from 
a random sample of twenty independent normal random 
variables with mean yw and variance o”, both unknown. 
Find a 90% confidence interval for the mean number of 
calories jj. 


8. At the end of Example 8.5.11, compute the probability 
that |X» — | < 0.3 given Z = 0.9. Why is it so large? 

9. In the situation of Example 8.5.11, suppose that we 
observe X; = 4.7 and Xj =5.3. 

a. Find the 50% confidence interval described in Exam- 
ple 8.5.11. 

b. Find the interval of possible 6 values that are consis- 
tent with the observed data. 

c. Is the 50% confidence interval larger or smaller than 
the set of possible @ values? 

d. Calculate the value of the random variable Z = Y, — 
Y, as described in Example 8.5.11. 

e. Use Eq. (8.5.15) to compute the conditional proba- 
bility that |X, — | <0.1 given Z equal to the value 
computed in part (d). 

10. In the situation of Exercise 9, suppose that a prior dis- 
tribution is used for 6 with p.d.f. €(0) = 0.1 exp(—0.16) for 
6 > 0. (This is the exponential distribution with parameter 
0.1.) 


a. Prove that the posterior p.d.f. of 6 given the data 


observed in Exercise 9 is 
4.122 exp(—0.10) if4.8<6 <5.2, 
0 


otherwise. 


E(O|x) = 


b. Calculate the posterior probability that |@ — x| < 
0.1, where X is the observed average of the data 
values. 


c. Calculate the posterior probability that 6 is in the 
confidence interval found in part (a) of Exercise 9. 


d. Can you explain why the answer to part (b) is so close 
to the answer to part (e) of Exercise 9? Hint: Com- 
pare the posterior p.d-f. in part (a) to the function in 
Eq. (8.5.15). 


11. Suppose that Xj, ..., X,, form a random sample from 
the Bernoulli distribution with parameter p. Let X,, be 
the sample average. Use the variance stabilizing transfor- 
mation found in Exercise 5 of Section 6.5 to construct an 
approximate coefficient y confidence interval for p. 


12. Complete the proof of Theorem 8.5.3 by dealing with 
the case in which r(v, x) is strictly decreasing in v for each 
x. 
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* 8.6 Bayesian Analysis of Samples from a 


Normal Distribution 


When we are interested in constructing a prior distribution for the parameters | 
and o” of anormal distribution, it is more convenient to work with t = 1/0”, called 
the precision. A conjugate family of prior distributions is introduced for x and Tt, 
and the posterior distribution is derived. Interval estimates of j can be constructed 
from the posterior and these are similar to confidence intervals in form, but they 
are interpreted differently. 


The Precision of a Normal Distribution 


Rain from Seeded Clouds. In Example 8.3.1, we mentioned that it was of interest 
whether the mean log-rainfall 2 from seeded clouds exceeded the mean log-rainfall 
of unseeded clouds, namely, 4. Although we were able to find an estimator of j4 and we 
were able to construct a confidence interval for 1, we have not yet directly addressed 
the question of whether or not jz > 4 or how likely it is that 4 > 4. If we construct a 
joint prior distribution for both ~ and o”, we can then find the posterior distribution 
of yz and finally provide direct answers to these questions. «J 


Suppose that X,,..., X, form a random sample from the normal distribution 
with unknown mean jz and unknown variance o?. In this section, we shall consider 
the assignment of a joint prior distribution to the parameters and o? and study 
the posterior distribution that is then derived from the observed values in the sam- 
ple. Manipulating prior and posterior distributions for the parameters of a normal 
distribution turns out to be simpler if we reparameterize from jz and o? to w and 
t=1/o’. 


Precision of a Normal Distribution. The precision t of a normal distribution is defined 
as the reciprocal of the variance; that is, tT = 1/o?. 


If a random variable has the normal distribution with mean jz and precision T, 
then its p.d.f. f(x|, T) 1s specified as follows, for —oo < x < 00: 


7 \ 1 , 
flu, tT) = (=) exp| — Feta —) | 


Similarly, if X,..., X, form a random sample from the normal distribution 
with mean yw and precision t, then their joint p.d-f. f,(v|u, tT) is as follows, for 
—0 <x; <oo(@=1,...,n): 


n/2 n 
= eo ae = py? 
SrQle, T) = (=) | 5 La [L) | 


A Conjugate Family of Prior Distributions 


We shall now describe a conjugate family of joint prior distributions for w and t. 
We shall specify the joint distribution of jz and t by specifying both the conditional 
distribution of jz given t and the marginal distribution of t. In particular, we shall 
assume that the conditional distribution of for each given value of t is a normal 
distribution for which the precision is proportional to the given value of t, and also 
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that the marginal distribution of t is a gamma distribution. The family of all joint 
distributions of this type is a conjugate family of joint prior distributions. If the joint 
prior distribution of jz and t belongs to this family, then for every possible set of 
observed values in the random sample, the joint posterior distribution of « and t 
will also belong to the family. This result is established in Theorem 8.6.1. We shall 
use the following notation in the theorem and the remainder of this section: 


1 n n 

- 2 = 9 

Fn =— D4 c=) =x) 
i=1 i=1 


Suppose that X,,..., X,, form a random sample from the normal distribution with 
unknown mean yw and unknown precision t (—oo < pp < oo and t > 0). Suppose also 
that the joint prior distribution of jz and 1 is as follows: The conditional distribution 
of given t is the normal distribution with mean jg and precision Agt (—oo < Lo < 
oo and Ay > 0), and the marginal distribution of t is the gamma distribution with 
parameters a and By (ap > 0 and fy > 0). Then the joint posterior distribution of 1 
and Tt, given that X; =x; fori =1,...,n, is as follows: The conditional distribution 
of « given t is the normal distribution with mean 1; and precision A,t, where 


_ Ago + NXn 
Xo + nN 


and the marginal distribution of t is the gamma distribution with parameters a, and 
By, where 


and A,;=Agj+a, (8.6.1) 


n 1 2 
=a+-= and = Bo+ + 
Oi Bi = Bo 75n 14am 


Proof The joint prior p.d.f. E(u, 7) of w and t can be found by multiplying the 
conditional p.df. €(4|t) of « given t by the marginal p.d.f. &(rt) of t. By the 
conditions of the theorem, we have, for —co < uw < co andr > 0, 


1 
Ex(ult) x ol? exp| —For ~ Ho 


and 
&n(T) X 1% le Bor. 


A constant factor involving neither jz nor t has been dropped from the right side of 
each of these relations. 
The joint posterior p.d.f. E(u, t|x) for uw and t satisfies the relation 


CCM, TIX) & fy(¥lM, TIS (HIT )E2(T) (8.6.3) 


(of n a: uy ; 
OT ot(n+1)/2-1 ex] -§ (te = jiok + xer ~ ») _ Ar | : 


i=l 


By adding and subtracting x,, inside the (x; — j2)” terms, we can prove that 


n 
Yo — mw) = 57 +0, — 1)’. (8.6.4) 
i=1 
Next, combine the last term in Eq. (8.6.4) with the term Ag(j2 — f49)* in (8.6.3) by 
completing the square (see Exercise 24 in Sec. 5.6) to get 


=, x Xp _ 2 
(Ky — LW)” + A(u — Uo)” = (Ap +0) = M4)? + ee (8.6.5) 
0 
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where /1 is defined in Eq. (8.6.1). Combining (8.6.4) with (8.6.5) yields 


n 
xer — pw)? + Ao(u — Wo)” = (Ag +1) (= py)? + 8% 
i=1 


Se 2 
4 MAG n — Ho) (8.6.6) 
Xo + n 


Using (8.6.2) and 4, =A9 +7 together with (8.6.6) allows us to write Eq. (8.6.3) in 
the form 


E(u, TIX) x ae exp| Sarr — wi? || (rte Ary, (8.6.7) 


where Ay, a1, and 6, are defined by Eqs. (8.6.1) and (8.6.2). 

When the expression inside the braces on the right side of Eq. (8.6.7) is regarded 
as a function of w for a fixed value of t, this expression can be recognized as 
being (except for a factor that depends on neither nor tr) the p.d.f. of the normal 
distribution with mean j2; and precision 4,t. Since the variable jx does not appear 
elsewhere on the right side of Eq. (8.6.7), it follows that this p.d.f. must be the 
conditional posterior p.d.f. of w given t. It now follows in turn that the expression 
outside the braces on the right side of Eq. (8.6.7) must be proportional to the marginal 
posterior p.d.f. of t. This expression can be recognized as being (except for a constant 
factor) the p.d.f. of the gamma distribution with parameters a, and 6,. Hence, the 
joint posterior distribution of yz and 1 is as specified in the theorem. o 


We shall give a name to the family of joint distributions described in Theo- 
rem 8.6.1. 


Normal-Gamma Family of Distributions. Let ~ and t be random variables. Suppose 
that the conditional distribution of jz. given t is the normal distribution with mean 
4g and precision Agt. Suppose also that the marginal distribution of t is the gamma 
distribution with parameters ap and 69. Then we say that the joint distribution of 
and t is the normal-gamma distribution with hyperparameters [1p, do, &o, and Bp. 


The prior distribution in Theorem 8.6.1 is the normal-gamma distribution with hy- 
perparameters 4g, Ap, a, and fp. The posterior distribution derived in that theorem 
is the normal-gamma distribution with hyperparameters 11, 41, a, and f;. As in 
Sec. 7.3, we shall refer to the hyperparameters of the prior distribution as prior hyper- 
parameters, and we shall refer to the hyperparameters of the posterior distribution 
as posterior hyperparameters. 

By choosing appropriate values of the prior hyperparameters, it is usually possi- 
ble in a particular problem to find a normal-gamma distribution that approximates 
an experimenter’s actual prior distribution of jz and t sufficiently well. It should be 
emphasized, however, that if the joint distribution of jz. and t is a normal-gamma 
distribution, then yz and t are not independent. Thus, it is not possible to use a normal- 
gamma distribution as a joint prior distribution of uw and t in a problem in which the 
experimenter wishes jz and t to be independent in the prior. Although this character- 
istic of the family of normal-gamma distributions is a deficiency, it is not an important 
deficiency, because of the following fact: Even if a joint prior distribution under which 
je and t are independent is chosen from outside the conjugate family, it will be found 
that after just a single value of X has been observed, jz and t will have a posterior 
distribution under which they are dependent. In other words, it is not possible for ju 
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and t to remain independent in the light of even one observation from the underlying 
normal distribution. 


Acid Concentration in Cheese. Consider again the example of lactic acid concentra- 
tion in cheese as discussed in Example 8.5.4. Suppose that the concentrations are 
independent normal random variables with mean yz and precision t. Suppose that 
the prior opinion of the experimenters could be expressed as a normal-gamma dis- 
tribution with hyperparameters wo = 1, Ag = 1, ag = 0.5, and Bp = 0.5. We can use the 
data on page 487 to find the posterior distribution of and t. In this case, n = 10, 
X, = 1.379, and és = 0.9663. Applying the formulas in Theorem 8.6.1, we get 


_ 1x1+10 x 1.379 10 


= ais Joes itet, 2064-4555, 
Hy 1410 1 =r ay + 5 
_— 1\2 
bapa {09663 SS Wage 
2 2(1 + 10) 


So, the posterior distribution of and t is the normal-gamma distribution with these 
four hyperparameters. In particular, we can now address the issue of variation in 
lactic acid concentration more directly. For example, we can compute the posterior 
probability that o = r~'/? is larger than some value such as 0.3: 


Pr(o > 0.3|x) = Pr(t < 11.11 |x) = 0.984. 


This can be found using any computer program that calculates the c.d.f. of a gamma 
distribution. Alternatively, we can use the relationship between the gamma and x? 
distributions that allows us to say that the posterior distribution of U =2 x 1.0484 x t 
is the x? distribution with 2 x 5.5 = 11 degrees of freedom. (See Exercise 1 in Sec. 5.7.) 
Then Pr(t < 11.11|¥) = Pr(U < 23.30|x) © 0.982 by interpolating in the table of the 
x° distributions in the back of the book. If o > 0.3 is considered a large standard 
deviation, the cheese manufacturer might wish to look into better quality-control 
measures. < 


The Marginal Distribution of the Mean 


When the joint distribution of 4 and t is a normal-gamma distribution of the type 
described in Theorem 8.6.1, then the conditional distribution of x for a given value of 
t is anormal distribution and the marginal distribution of t is a gamma distribution. 
It is not clear from this specification, however, what the marginal distribution of jw 
will be. We shall now derive this marginal distribution. 


Marginal Distribution of the Mean. Suppose that the prior distribution of jz and t is 
the normal-gamma distribution with hyperparameters 1p, Ag, a, and Bo. Then the 
marginal distribution of y is related to a ¢ distribution in the following way: 


iat) 1/2 
——]}] (- bo) 
( Bo M— Lo 


has the ¢ distribution with 2a, degrees of freedom. 


Proof Since the conditional distribution of jz given t is the normal distribution 
with mean jp and variance (Ajt)~!, we can use Theorem 5.6.4 to conclude that 
the conditional distribution of Z = (Agt)!/?(u — wo) given t is the standard normal 
distribution. We shall continue to let &(t) be the marginal p.d.f. of t, and let €,(j|r) 
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be the conditional p.d.f. of w given t. Then the joint p.d.f. of Z and t is 


F(Z, tT) = (gt) YG (Agr) /7z + pglt E(t) = OZ) E(C), (8.6.8) 


where ¢ is the standard normal p.d.f. of Eq. (5.6.6). We see from Eq. (8.6.8) that 
Z and t are independent with Z having the standard normal distribution. Next, let 
Y =26ot. Using the result of Exercise 1 in Sec. 5.7, we find that the distribution of 
Y is the gamma distribution with parameters ap and 1/2, which is also known as the 
x° distribution with 2a) degrees of freedom. In summary, Y and Z are independent 
with Z having the standard normal distribution and Y having the x? distribution with 
2a degrees of freedom. It follows from the definition of the ¢ distributions in Sec. 8.4 
that 


Z (Az) /2(u — uo) _ from \/” 
U= 1/2 = 0 B a = oo (iL = Lo) (8.6.9) 
(a) = Ga)" 
2a 2a 
has the ¢ distribution with 2a, degrees of freedom. = 


Theorem 8.6.2 can also be used to find the posterior distribution of jz after data 
are observed. To do that, just replace wo by 4, Ag by Ay, ao by aj, and Boy by f, 
in the statement of the theorem. The reason for this is that the prior and posterior 
distributions both have the same form, and the theorem depends only on that form. 
This same reasoning applies to the discussion that follows, including Theorem 8.6.3. 

An alternative way to describe the marginal distribution of x starts by rewriting 
(8.6.9) as 


B 1/2 
= (,%-) U + Uo. (8.6.10) 


Now we see that the distribution of 4 can be obtained from a ¢ distribution by 
translating the ¢ distribution so that it is centered at yg rather than at 0, and also 
changing the scale factor. This makes it straightforward to find the moments (if they 
exist) of the distribution of ju. 


Suppose that yw and t have the joint normal-gamma distribution with hyperparame- 
ters [1p, Ap, @p, and Bo. If ag > 1/2, then E(w) = wp. If ap > 1, then 


Bo 


= Fao —D 


(8.6.11) 


Proof The mean and the variance of the marginal distribution of jz can easily be 
obtained from the mean and the variance of the ¢ distributions that are given in 
Sec. 8.4. Since U in Eq. (8.6.9) has the ¢ distribution with 2a) degrees of freedom, it 
follows from Section 8.4 that E(U) = 0 if ag > 1/2 and that Var(U) = ap/(a@p — 1) if 
ay > 1. Now use Eq. (8.6.10) to see that if ag > 1/2, then E(z) = po. Also, if ag > 1, 
then 


Po 


Var(u) = (; - ) Var(U). 
00 


Eq. (8.6.11) now follows directly. rT] 
Furthermore, the probability that y lies in any specified interval can, in principle, 


be obtained from a table of the r distribution or appropriate software. Most statistical 
packages include functions that can compute the c.d.f. and the quantile function of 
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at distribution with arbitrary degrees of freedom, not just integers. Tables typically 
deal solely with integer degrees of freedom. If necessary, one can interpolate between 
adjacent degrees of freedom. 

As we pointed out already, we can change the prior hyperparameters to pos- 
terior hyperparameters in Theorems 8.6.2 and 8.6.3 and translate them into results 
concerning the posterior marginal distribution of jw. In particular, the posterior dis- 
tribution of the following random variable is the f distribution with 2a, degrees of 
freedom: 


1/2 
(2%) (uw — 111). (8.6.12) 
By 


A Numerical Example 


Nursing Homes in New Mexico. In 1988, the New Mexico Department of Health 
and Social Services recorded information from many of its licensed nursing homes. 
The data were analyzed by Smith, Piland, and Fisher (1992). In this example, we 
shall consider the annual medical in-patient days X (measured in hundreds) for a 
sample of 18 nonrural nursing homes. Prior to observing the data, we shall model 
the value of X for each nursing home as a normal random variable with mean jy and 
precision t. To choose a prior mean and variance for w and t, we could speak with 
experts in the field, but for simplicity, we shall just base these on some additional 
information we have about the numbers of beds in these nursing homes. There are, 
on average, 111 beds with a sample standard deviation of 43.5 beds. Suppose that 
our prior opinion is that there is a 50 percent occupancy rate. Then we can naively 
scale up the mean and standard deviation by a factor of 0.5 x 365 to obtain a prior 
mean and standard deviation for the number of in-patient days in a year. In units of 
hundreds of in-patient days per year, this gives us a mean of 0.5 x 365 x 1.11 ~ 200 
and a standard deviation of 0.5 x 365 x 0.435 ~ 6300!/*. To map these values into 
prior hyperparameters, we shall split the variance of 6300 so that half of it is due to 
variance between the nursing homes and half is the variance of jz. That is, we shall set 
Var (jw) = 3150 and E(t) = 1/3150. We choose ap = 2 to reflect only a small amount of 
prior information. Then, since E(t) = ag/ Bo, we find that By = 6300. Using E (1) = Uo 
and (8.6.11), we get 9 = 200 and Ay = 2. 

Next, we shall determine an interval for jz centered at the point 49 = 200 such 
that the probability that ju lies in this interval is 0.95. Since the random variable U 
defined by Eq. (8.6.9) has the f distribution with 2a9 degrees of freedom, it follows 
that, for the numerical values just obtained, the random variable 0.025(44 — 200) has 
the ¢ distribution with four degrees of freedom. The table of the r distribution gives 
the 0.975 quantile of the ¢ distribution with four degrees of freedom as 2.776. So, 


Pr[—2.776 < 0.025(u — 200) < 2.776] = 0.95. (8.6.13) 
An equivalent statement is that 
Pr(89 < px < 311) = 0.95. (8.6.14) 


Thus, under the prior distribution assigned to yw and 1, there is probability 0.95 that 
lies in the interval (89, 311). 

Suppose now that the following is our sample of 18 observed numbers of medical 
in-patient days (in hundreds): 


128 281 291 238 155 148 154 232 316 96 146 151 100 213 208 157 48 217. 


Figure 8.7 Plots of prior 
and posterior p.d.f.’s of jz in 
Example 8.6.3. The posterior 
probability interval (8.6.18) 
is indicated at the bottom of 
the graph. The corresponding 
prior probability interval 
(8.6.14) would extend far 
beyond both sides of the plot. 
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For these observations, which we denote x, x, = 182.17 and s2 = 88678.5. Then, it 
follows from Theorem 8.6.1 that the joint posterior distribution of w and Tt is the 
normal-gamma distribution with hyperparameters 


i= 183.95, A,=20, of=11, B,=50925.37. (8.6.15) 


Hence, the values of the means and the variances of yz and tT, as found from this joint 
posterior distribution, are 


E(u\x) = pu, = 183.95, Vary" se. 
ae?) 8.6.16) 
i" (8.6. 


E(t|x) = a 2.161074, Var(t|x) = a = 4.24 x 107°. 
1 1 


It follows from Eq. (8.6.1) that the mean j, of the posterior distribution of jw is a 
weighted average of jp and x,,. In this numerical example, it is seen that jz; is quite 
close to x,,. 

Next, we shall determine the marginal posterior distribution of 4. Let U be 
the random variable in Eq. (8.6.12), and use the values computed in (8.6.15). Then 
U = (0.0657)( — 183.95), and the posterior distribution of U is the ¢ distribution 
with 2a, = 22 degrees of freedom. The 0.975 quantile of this ¢ distribution is 2.074, 
so 


Pr(—2.074 < U <2.074|x) = 0.95. (8.6.17) 
An equivalent statement is that 
Pr(152.38 < pw < 215.52|x) = 0.95. (8.6.18) 


In other words, under the posterior distribution of jz and t, the probability that 
lies in the interval (152.38, 215.52) is 0.95. 

It should be noted that the interval in Eq. (8.6.18) determined from the posterior 
distribution of jz is much shorter than the interval in Eq. (8.6.14) determined from 
the prior distribution. This result reflects the fact that the posterior distribution of 
j£ is much more concentrated around its mean than was the prior distribution. The 
variance of the prior distribution of 42 was 3150, and the variance of the posterior 
distribution is 254.63. Graphs of the prior and posterior p.d.f.’s of are in Fig. 8.7 


together with the posterior interval (8.6.18). < 
p.d.f. A 
as ---— Prior 
0.025 +- a Posterior 
= ‘ === Interval 

0.015 + 3 “ 
0.010 + fo een ee 
0.005 + wae > 8 

es re : f pees > 
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Comparison with Confidence Intervals Continue using the nursing home data 
from Example 8.6.3. We shall now construct a confidence interval for 4 with con- 
fidence coefficient 0.95 and compare this interval with the interval in Eq. (8.6.18) for 
which the posterior probability is 0.95. Since the sample size n in Example 8.6.3 is 
18, the random variable U defined by Eq. (8.4.4) on page 481 has the r distribution 
with 17 degrees of freedom. The 0.975 quantile of this ¢ distribution is 2.110. It now 
follows from Theorem 8.5.1 that the endpoints of a confidence interval for 4 with 
confidence coefficient 0.95 will be 
—" o’ 
A=X,- 2.1105. 
a o’ 
B=xX,+ 2110s 


When the observed values of x, = 182.17 and te = 88678.5 are used here, we 
get o/ = (88678.5/17)!/? = 72.22. The observed confidence interval for jw is then 
(146.25, 218.09). 

This interval is close to the interval (152.38, 215.52) in Eq. (8.6.18), for which 
the posterior probability is 0.95. The similarity of the two intervals illustrates the 
statement made at the end of Sec. 8.5. That is, in many problems involving the normal 
distribution, the method of confidence intervals and the method of using posterior 
probabilities yield similar results, even though the interpretations of the two methods 
are quite different. 


Improper Prior Distributions 


As we discussed at the end of Sec. 7.3 on page 402, it is often convenient to use 
improper priors that are not real distributions, but do lead to posteriors that are 
real distributions. These improper priors are chosen more for convenience than to 
represent anyone’s beliefs. When there is a sizeable amount of data, the posterior 
distribution that results from use of an improper prior is often very close to one 
that would result from a proper prior distribution. For the case that we have been 
considering in this section, we can combine the improper prior that we introduced for 
a location parameter like together with the improper prior for a scale parameter 
like o =1~'/? into the usual improper prior for 4 and t. The typical improper prior 
“p.d.f.” for a location parameter was found (in Example 7.3.15) to be the constant 
function € (4) = 1. The typical improper prior “p.d.f.” for a scale parameter o is 
g(a) = 1/o. Since o = t~ 7, we can apply the techniques of Sec. 3.8 to find the 
improper “p.d.f.” of t =o~*. The derivative of the inverse function is —4r79/ 2,80 
the improper “p.d.f.” of t would be 


1 __3/2 io, 1 
—T 1/t =-T, 
; ed/ ) ; 


for t > 0. Since this function has infinite integral, we shall drop the factor 1/2 and set 
&(t) =t |. If we act as if jz and t were independent, then the joint improper prior 
“p.d.f” for 4 and t is 


ee ee for -—co<pw<aw,t>Q0. 
T 


If we were to pretend as if this function were a p.d.f., the posterior p.d.f. E(u, t|x) 
would be proportional to 
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E(u, T) fa elm, T) cote? exp(—5s2 - ma - a) (8.6.19) 


2 
= ee exp| Et 7 || ae ex 3] 


When the expression inside the braces on the far right side of (8.6.19) is regarded as a 
function of yw for fixed value of t, this expression can be recognized as being (except 
for a factor that depends on neither yz nor T) the p.d.f. of the normal distribution with 
mean x, and precision nt. Since the variable jz does not appear elsewhere, it follows 
that this p.d.f. must be the conditional posterior p.d-f. of uw given t. It now follows in 
turn that the expression outside the braces on the far right side of (8.6.19) must be 
proportional to the marginal posterior p.d.f. of r. This expression can be recognized 
as being (except for a constant factor) the p.d.f. of the gamma distribution with 
parameters (n — 1)/2 and A /2. This joint distribution would be in precisely the same 
form as the distribution in Theorem 8.6.1 if our prior distribution had been of the 
normal-gamma form with hyperparameters jz = Bo = Ay = 0 and ay = —1/2. That is, 
if we pretend as if (49 = By = Ap = 0 and ag = —1/2, and then we apply Theorem 8.6.1, 
we get the posterior hyperparameters 1, =X,, A, =n, a = (n — 1)/2, and B, = 52/2. 

There is no probability distribution in the normal-gamma family with 1p = Bo = 
Ag = 0 and a = —1/2; however, if we pretend as if this were our prior, then we 
are said to be using the usual improper prior distribution. Notice that the posterior 
distribution of yw and t is areal member of the normal-gamma family so long as n > 2. 


An Improper Prior for Seeded Cloud Rainfall. Suppose that we use the usual improper 
prior for the parameters in Examples 8.3.2 and 8.5.3 with prior hyperparameters 
Lo = Bo = A9 = 0 and ag = —1/2. The data summaries are x, = 5.134 and ce = 63.96. 
The posterior distribution will then be the normal-gamma distribution with hyperpa- 
rameters 4 =X, = 5.134, Ay =n = 26, a, = (n — 1)/2 = 12.5, and B, = gj = 31.98. 
Also, the marginal posterior distribution of jz is given by (7.6.12). In particular, 


7 (% x 12.5 


31.98 


has the ¢ distribution with 25 degrees of freedom. Suppose that we want an interval 
(a, b) such that the posterior probability of a < yu <b is 0.95. The 0.975 quantile of 
the ¢ distribution with 25 degrees of freedom is 2.060. So, we have that Pr(—2.060 < 
U < 2.060) = 0.95. Combining this with (8.6.20), we get 


Pr(5.134 — 2.060/3.188 < pp < 5.134 + 2.060/3.188|x) = 0.95. 


The interval we need runs from a = 5.134 — 2.060/3.188 = 4.488 to b =5.134+ 
2.060/3.188 = 5.780. Notice that the interval (4.488, 5.780) is precisely the same as 
the 95% confidence interval for that was computed in Example 8.5.3. 

Another calculation that we can do with this posterior distribution is to see how 
likely it is that uw > 4, where 4 is the mean of log-rainfall for unseeded clouds: 


Pr(u > 4|x) = Pr(U > 3.188(4 — 5.134)|x) = 1 — Ts(—3.615) = 0.9993, 


1/2 
) (uw — 5.134) = 3.188(u — 5.134) (8.6.20) 


where the final value is calculated using statistical software that includes the c.d.f’s 
of all ¢ distributions. It appears quite likely, after observing the data, that the mean 
log-rainfall of seeded clouds is more than 4. < 


Note: Improper Priors Lead to Confidence Intervals. Example 8.6.4 illustrates one 
of the more interesting properties of the usual improper prior. If one uses the usual 
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improper prior with normal data, then the posterior probability is y that yw is in the 
observed value of a coefficient y confidence interval. In general, if we apply (8.6.9) 
after using an improper prior, we find that the posterior distribution of 


1/2 
n(n —1 = 
U= ( ( 2 ) (u — Xp) (8.6.21) 
n 

is the r distribution with n — 1 degrees of freedom. It follows that if Pr(—c < U <c) = 

y, then 
o’ o’ 
Pr{ x, — <w<X,+ =y. 8.6.22 
(- ae UL Xn C7 ni/2 x) Y ( ) 


The reader will notice the striking similarity between (8.6.22) and (8.5.3). The differ- 
ence between the two is that (8.6.22) is a statement about the posterior distribution 
of uu after observing the data, while (8.5.3) is a statement about the conditional dis- 
tribution of the random variables X,, and o’ given yz and o before observing the data. 
That these two probabilities are the same for all possible data and all possible values 
of y follows from the fact that they are both equal to Pr(—c < U <c) where U is 
defined either in Eq. (8.4.4) or Eq. (8.6.21). The sampling distribution (conditional 
on yw and Tt) of U is the ¢ distribution with n — 1 degrees of freedom, as we found in 
Eq. (8.4.4). The posterior distribution from the improper prior (conditional on the 
data) of U is also the ¢ distribution with n — 1 degrees of freedom. 

The same kind of thing happens when we try to estimate 0? = 1/t. The sampling 
distribution (conditional on yw and t) of V = (n — Lot =(n— Do? /o? is the x2 
distribution with n — 1 degrees of freedom, as we saw in Eq. (8.3.11). The posterior 
distribution from the improper prior (conditional on the data) of V is also the x? 
distribution with n — 1 degrees of freedom (see Exercise 4). Therefore, a coefficient 
y confidence interval (a, b) for o” based on the sampling distribution of V will satisfy 
Pr(a <0? <b|x) = y as a posterior probability statement given the data if we used 
an improper prior. 

There are many situations in which the sampling distribution of a pivotal quantity 
like U above is the same as its posterior distribution when an improper prior is used. 
A very mathematical treatment of these situations can be found in Schervish (1995, 
chapter 6). The most common situations are those involving location parameters (like 
4) and/or scale parameters (like o). 


Summary 


We introduced a family of conjugate prior distributions for the parameters jw and 
t = 1/o* of anormal distribution. The conditional distribution of jz given t is normal 
with mean jg and precision Apt, and the marginal distribution of t is the gamma 
distribution with parameters ap and Bo. If X; = x1, ..., X, =x, is an observed sample 
of size n from the normal distribution with mean y and precision tT, then the posterior 
distribution of jz given t is the normal distribution with mean jz; and precision qT, 
and the posterior distribution of t is the gamma distribution with parameters a, and 
fh, where the values of 441, Ay, a1, and B, are given in Eq. (8.6.1) and (8.6.2). The 
marginal posterior distribution of ju is given by saying that (A,@;/B,)!/?(u — 1) has 
the ¢ distribution with 2a, degrees of freedom. An interval containing probability 
1 — a of the posterior distribution of ju is 


B 1/2 B 1/2 
(1 Pa a2) "a rece. jd-a/2)|- A] ) 
AY 
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If we use the improper prior with prior hyperparameters ag = —1/2 and uo = 
do = By =O, then the random variable n'/*(X,, — )/o' has the t distribution with 
n — 1 degrees of freedom both as its posterior distribution given the data and as 
its sampling distribution given yz and o. Also, (n — 1)o"/o? has the x? distribution 
with n — 1 degrees of freedom both as its posterior distribution given the data and 
as its sampling distribution given uw and o. Hence, if we use the improper prior, 
interval estimates of yz or o based on the posterior distribution will also be confidence 


intervals, and vice versa. 


Exercises 


1. Suppose that a random variable X has the normal dis- 
tribution with mean yp and precision t. Show that the 
random variable Y =aX + b (a £0) has the normal dis- 
tribution with mean ay + b and precision t/a’. 


2. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean ju (—oo < 
jt < oo) and known precision t. Suppose also that the prior 
distribution of jz is the normal distribution with mean jo 
and precision Ay. Show that the posterior distribution of jx, 


given that X; =x; (i =1,...,m) is the normal distribution 
with mean 
Ago +NTXy 
Ag + nt 


and precision Ag + nT. 


3. Suppose that X,,..., X, form a random sample from 
the normal distribution with known mean yu and unknown 
precision t (t > 0). Suppose also that the prior distribu- 
tion of t is the gamma distribution with parameters ap and 
Bo (a9 > 0 and Bo > 0). Show that the posterior distribu- 
tion of t given that X; =x; (( =1,...,7) is the gamma 
distribution with parameters ag + (n/2) and 


ru 
Bot = di - W?. 
24 
i=1 

4. Suppose that X;,..., X,, are iid. having the normal 
distribution with mean yz and precision tT given (, T). Let 
(uw, T) have the usual improper prior. Let g= s*/(n — 1). 
Prove that the posterior distribution of V = (n — Lot is 
the x? distribution with n — 1 degrees of freedom. 


5. Suppose that two random variables and t have 
the joint normal-gamma distribution such that E(w) = 
—5. Var(u) =1, E(t) = 1/2, and Var(t) = 1/8. Find the 
prior hyperparameters j1p, Ap, @p, and fp that specify the 
normal-gamma distribution. 


6. Show that two random variables and t cannot have 
a joint normal-gamma distribution such that E(w) = 0, 
Var(w) = 1, E(t) = 1/2, and Var(t) = 1/4. 


7. Show that two random variables jz and t cannot have 
the joint normal-gamma distribution such that E(w) = 
0, E(t) =1, and Var(rt) = 4. 


8. Suppose that two random variables jz. and t have the 
joint normal-gamma distribution with hyperparameters 
Lo = 4, Ag = 0.5, ap = 1, and Bo = 8. Find the values of (a) 
Pr(ju > 0) and (b) Pr(0.736 < yw < 15.680). 


9. Using the prior and data in the numerical example 
on nursing homes in New Mexico in this section, find 
(a) the shortest possible interval such that the posterior 
probability that jz lies in the interval is 0.90, and (b) the 
shortest possible confidence interval for 4 for which the 
confidence coefficient is 0.90. 


10. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean yw and un- 
known precision t, and also that the joint prior distribu- 
tion of 4 and t is the normal-gamma distribution satisfying 
the following conditions: E(w) =0, E(t) =2, E(t?) =5, 
and Pr(|jz| < 1.412) = 0.5. Determine the prior hyperpa- 
rameters (Wg, Ag, &, and Bo. 


11. Consider again the conditions of Exercise 10. Suppose 
also that in a random sample of size n = 10, it is found that 
xX, = land ss = 8. Find the shortest possible interval such 
that the posterior probability that w lies in the interval 
is 0.95. 


12. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean yw and un- 
known precision t, and also that the joint prior distribu- 
tion of 4 and t is the normal-gamma distribution satisfying 
the following conditions: E(t) = 1, Var(t) = 1/3, Pr(u > 
3) = 0.5, and Pr(u > 0.12) = 0.9. Determine the prior hy- 
perparameters [Wg, Ag, A, and Bo. 


13. Consider again the conditions of Exercise 12. Suppose 
also that in a random sample of size n = 8, it is found that 
yy x; = 16 and )77_, x7 = 48. Find the shortest possible 
interval such that the posterior probability that yw lies in 
the interval is 0.99. 
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14. Continue the analysis in Example 8.6.2 on page 498. 
Compute an interval (a, b) such that the posterior proba- 
bility is 0.9 that a < 4 < b. Compare this interval with the 
90% confidence interval from Example 8.5.4 on page 487. 


15. We will draw a sample of size n = 11 from the normal 
distribution with mean yw and precision t. We will use a 
natural conjugate prior for the parameters (uw, t) from 
the normal-gamma family with hyperparameters ap = 2, 
Bo = 1, Uo = 3.5, and Ap = 2. The sample yields an average 
of x, = 7.2 and ge = 20.3. 


a. Find the posterior hyperparameters. 


b. Find an interval that contains 95% of the posterior 
distribution of jw. 


16. The study on acid concentration in cheese included 
a total of 30 lactic acid measurements, the 10 given in 
Example 8.5.4 on page 487 and the following additional 
20: 


1.68, 1.9, 1.06, 1.3, 1.52, 1.74, 1.16, 1.49, 1.63, 1.99, 
1.15, 1.33, 1.44, 2.01, 1.31, 1.46, 1.72, 1.25, 1.08, 1.25. 


a. Using the same prior as in Example 8.6.2 on page 498, 
compute the posterior distribution of . and t based 
on all 30 observations. 


b. Use the posterior distribution found in Example 8.6.2 
on page 498 as if it were the prior distribution before 
observing the 20 observations listed in this problem. 
Use these 20 new observations to find the posterior 


distribution of . and t and compare the result to the 
answer to part (a). 


17. Consider the analysis performed in Example 8.6.2. 
This time, use the usual improper prior to compute the 
posterior distribution of the parameters. 


18. Treat the posterior distribution conditional on the first 
10 observations found in Exercise 17 as a prior and then 
observe the 20 additional observations in Exercise 16. 
Find the posterior distribution of the parameters after ob- 
serving all of the data and compare it to the distribution 
found in part (b) of Exercise 16. 


19. Consider the situation described in Exercise 7 of 
Sec. 8.5. Use a prior distribution from the normal-gamma 
family with values ag = 1, By = 4, Wg = 150, and Ap = 0.5. 


a. Find the posterior distribution of and t = 1/o7. 


b. Find an interval (a, b) such that the posterior proba- 
bility is 0.90 that a <u <b. 


20. Consider the calorie count data described in Exam- 
ple 7.3.10 on page 400. Now assume that each observation 
has the normal distribution with unknown mean pw and 
unknown precision t given the parameter (ju, tT). Use the 
normal-gamma conjugate prior distribution with prior hy- 
perparameters lg = 0, Ap = 1, a9 = 1, and Bo = 60. The 
value of - is 2102.9. 


a. Find the posterior distribution of (1, T). 
b. Compute Pr(u > 1|x). 


8.7 Unbiased Estimators 


Let 6 be an estimator of a function g of a parameter 6. We say that 5 is unbiased 
if Eg[S(X)] = g() for all values of 0. This section provides several examples of 


unbiased estimators. 


Definition of an Unbiased Estimator 


Example 
8.7.1 


Lifetimes of Electronic Components. Consider the company in Example 8.1.3 that 
wants to estimate the failure rate 6 of electronic components. Based on a sample 
X1, X>, X; of lifetimes, the M.L.E. of 6 is 6 =3/T, where T = X, + X> + X3. The 
company hopes that 6 will be close to 6. The mean of a random variable, such as 6, 
is one measure of where we expect the random variable to be. The mean of 3/T is 
(according to Exercise 21 in Sec. 5.7) 36/2. If the mean tells us where we expect the 
estimator to be, we expect this estimator to be 50% larger than 0. < 


Let X = (Xj,..., X,) be a random sample from a distribution that involves a 
parameter (or parameter vector) @ whose value is unknown. Suppose that we wish 
to estimate a function g(9) of the parameter. In a problem of this type, it is desirable to 
use an estimator 5(X) that, with high probability, will be close to g(@). In other words, 


Definition 
8.7.1 


Example 
8.7.2 


Corollary 
8.7.1 


Example 
8.7.3 
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it is desirable to use an estimator 5 whose distribution changes with the value of 6 in 
such a way that no matter what the true value of 6 is, the probability distribution of 
6 is concentrated around g(6). 

For example, suppose that X = (X),..., X,) form a random sample from a 
normal distribution for which the mean @ is unknown and the variance is 1. In this 
case, the M.L.E. of @ is the sample mean X,,. The estimator X,, is a reasonably good 
estimator of 6 because its distribution is the normal distribution with mean 6 and 
variance 1/n. This distribution is concentrated around the unknown value of 6, no 
matter how large or how small 6 is. 

These considerations lead to the following definition. 


Unbiased Estimator/Bias. An estimator 5(X) is an unbiased estimator of a function g (0) 
of the parameter 0 if Eg[6(X)]= g(6) for every possible value of 6. An estimator that 
is not unbiased is called a biased estimator. The difference between the expectation 
of an estimator and g(6) is called the bias of the estimator. That is, the bias of 5 as an 
estimator of g(@) is Eg[5(X)] — g(@), and 6 is unbiased if and only if the bias is 0 for 
all 0. 


In the case of a sample from a normal distribution with unknown mean @, X,, is 
an unbiased estimator of 6 because F,(X,,) = 6 for —oo <8 < &. 


Lifetimes of Electronic Components. In Example 8.7.1, the bias of 6 =3/T as an 
estimator of 6 is 30/2 — 6 =6/2. It is easy to see that an unbiased estimator of 0 
is 5(X) =2/T. < 


If an estimator 6 of some nonconstant function g(@) of the parameter is unbiased, 
then the distribution of 5 must indeed change with the value of 6, since the mean 
of this distribution is g(@). It should be emphasized, however, that this distribution 
might be either closely concentrated around g(@) or widely spread out. For example, 
an estimator that is equally likely to underestimate g(@) by 1,000,000 units or to 
overestimate g(@) by 1,000,000 units would be an unbiased estimator, but it would 
never yield an estimate close to g(6). Therefore, the mere fact that an estimator is 
unbiased does not necessarily imply that the estimator is good or even reasonable. 
However, if an unbiased estimator also has a small variance, it follows that the 
distribution of the estimator will necessarily be concentrated around its mean g(8), 
and there will be high probability that the estimator will be close to g(@). 

For the reasons just mentioned, the study of unbiased estimators is largely 
devoted to the search for an unbiased estimator that has a small variance. However, 
if an estimator 6 is unbiased, then its M.S.E. E,[(6 — g(0))*] is equal to its variance 
Varg(6). Therefore, the search for an unbiased estimator with a small variance is 
equivalent to the search for an unbiased estimator with a small M.S.E. The following 
result is a simple corollary to Exercise 4 in Sec. 4.3. 


Let 6 be an estimator with finite variance. Then the M.S.E. of 5 as an estimator of 
g(@) equals its variance plus the square of its bias. 


Lifetimes of Electronic Components. We can compare the two estimators 6 and 3(X) 
in Example 8.7.2 using M.S.E. According to Exercise 21 in Sec. 5.7, the variance of 
1/T is 67/4. So, the M.S.E. of 5(X) is 6. For 6, the variance is 9697/4 and the square 
of the bias is 67/4, so the M.S.E. is 507/2, which is 2.5 times as large as the M.S.E. 
of 5(X). If M.S.E. were the sole concern, the estimator 6*(X) =1/T has variance 
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Figure 8.8 M.S.E. for each 
of the four estimators in 
Example 8.7.3. 


Example 
8.7.4 


Theorem 
8.7.1 


M.S.E. A 
| Bayes 
un --- MLE “ 
aveeeades Unbiased “ 
gl VT “ 


and squared bias both equal to 67/4, so the M.S.E. is 67/2, half the M.S.E. of the 
unbiased estimator. Figure 8.8 plots the M.S.E. for each of these estimators together 
with the M.S.E. of the Bayes estimator 4/(2 + T) found in Example 8.1.3. Calculation 
of the M.S.E. of the Bayes estimator required simulation. Eventually (above 6 = 3.1), 
the M.S.E. of the Bayes estimator crosses above the M.S.E. of 1/T, but it stays below 
the other two for all 0. < 


Unbiased Estimation of the Mean. Let X¥ = (Xj,..., X,,) be arandom sample from a 
distribution that depends on a parameter (or parameter vector) 6. Assume that the 
mean and variance of the distribution are finite. Define g(@) = Eg(X 1). The sample 
mean X,, is obviously an unbiased estimator of g(0). Its M.S.E. is Varg(Xj)/n. In 
Example 8.7.1, ¢(9) = 1/6 and X,, = 1/6 is an unbiased estimator the mean. <l 


Unbiased Estimation of the Variance 


Sampling from a General Distribution. Let ¥ = (X,..., X,,) be arandom sample from 
a distribution that depends on a parameter (or parameter vector) 0. Assume that the 
variance of the distribution is finite. Define g(0) = Varg(X,). The following statistic 
is an unbiased estimator of the variance g(6): 


Proof Let uw = E,(X,), and let o? stand for g(0) = Varg(X1). Since the sample mean 
is an unbiased estimator of jz, it is more or less natural to consider first the sample 
variance 6a =(l/n) 54 X,,)° and to attempt to determine if it is an unbiased 
estimator of the variance o. We shall use the identity 


n n 
= w=) 2 = BY, 
i=l 


i=1 


Then it follows that 


Example 
8.7.5 
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x 1 n = 
E(6Q)=E E Cen XP 


1 n = 
=2(* ae< w?| E[(X, — #)"} 


n 


(8.7.1) 


Since each observation X; has mean yu and variance o”, then E [(X; — )"] =o” for 
i=1,...,n. Therefore, 


: i; dK - "| ~ : > AIG: - )7]= 1 no? = 02. (8.7.2) 
i=1 i=1 


n 


Furthermore, the sample mean X,, has mean jz and variance o7/n. Therefore, 


2 
E[(X, — w)*]= Var(X,) =~. (8.7.3) 
nN 
It now follows from Eqs. (8.7.1), (8.7.2), and (8.7.3) that 
E(6>) =o? — 2 = 2 (8.7.4) 
nN n 


It can be seen from Eq. (8.7.4) that the sample variance 6a is not an unbiased 


estimator of 0”, because its expectation is [(n — 1)/ njo?, rather than o2. However, if 
ai is multiplied by the factor n/(n — 1) to obtain the statistic a, then the expectation 


of at will indeed be o2. Therefore, ét is an unbiased estimator of o?. a 


2 


In light of Theorem 8.7.1, many textbooks define the sample variance as 67, 


rather than as a, 


Note: Special Case of Normal Random Sample. The estimator ae is the same as 


~ 


the maximum likelihood estimator o2 of o* when Xj,..., X, have the normal 
distribution with mean jz and variance o”. Also, oe is the same as the random variable 
o” that appears in confidence intervals for 1. We have chosen to use different names 
for these estimators in this section because we are discussing general distributions 
for which o” might be some function g(@) whose M.L.E. is completely different from 
Ore (See Exercise 1 for one such example.) 


Sampling from a Specific Family of Distributions When it can be assumed that 
X1,..., X, form a random sample from a specific family of distributions, such as the 
family of Poisson distributions, it will generally be desirable to consider not only a; 
but also other unbiased estimators of the variance. 


Sample from a Poisson Distribution. Suppose that we observe a random sample from 
the Poisson distribution for which the mean 6 is unknown. We have already seen that 
X,, will be an unbiased estimator of the mean @. Moreover, since the variance of a 
Poisson distribution is also equal to @, it follows that X,, is also an unbiased estimator 
of the variance. In this example, therefore, both X,, and at are unbiased estimators 
of the unknown variance 9. Furthermore, any combination of X,, and at having the 
form aX, + (1— a)6e?, where a is a given constant (—oo < a < 00), will also be an 
unbiased estimator of 6 because its expectation will be 


E[aX, + 1—«)67]=aE(X,) + (1—a@)E(67) =00 + (1—a)0 =8. (8.7.5) 


Other unbiased estimators of 6 can also be constructed. < 
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Example 
8.7.6 


If an unbiased estimator is to be used, the problem is to determine which one of 
the possible unbiased estimators has the smallest variance or, equivalently, has the 
smallest M.S.E. We shall not derive the solution to this problem right now. However, 
it will be shown in Sec. 8.8 that in Example 8.7.5, for every possible value of 6, the 
estimator X,, has the smallest variance among all unbiased estimators of 6. This result 
is not surprising. We know from Example 7.7.2 that X,, is a sufficient statistic for 6, 
and it was argued in Sec. 7.9 that we can restrict our attention to estimators that 
are functions of the sufficient statistic alone. (See also Exercise 13 at the end of this 
section.) 


Sampling from a Normal Distribution. Assume that X¥ = (Xj, ..., X,,) form a random 
sample from the normal distribution with unknown mean yp and unknown variance 
o*. We shall consider the problem of estimating o”. We know from Theorem 8.7.1 
that the estimator ét is an unbiased estimator of o2. Moreover, we know from 
Example 7.5.6 that the sample variance af is the M.L.E. of 0”. We want to determine 
whether the M.S.E. E [(6? —o)]is smaller for the estimator 6s or for the estimator 
a, and also whether or not there is some other estimator of o” that has a smaller 
M.S.E. than both 65 and ee 
Both the estimator oe and the estimator a. have the following form: 


n 
T.=c) %;—-X,), (8.7.6) 
i=1 

where c = 1/n for a; and c = 1/(n — 1) for aa. We shall now determine the M.S.E. 
for an arbitrary estimator having the form in Eq. (8.7.6) and shall then determine 
the value of c for which this M.S.E. is minimum. We shall demonstrate the striking 
property that the same value of c minimizes the M.S.E. for all possible values of the 
parameters jz and o”. Therefore, among all estimators having the form in Eq. (8.7.6), 
there is a single one that has the smallest M.S.E. for all possible values of jz and 07. 
It was shown in Sec. 8.3 that when X;,..., X, form a random sample from a 
normal distribution, the random variable }~"_,(X; — X,,)*/o7 has the x? distribution 
with n — 1 degrees of freedom. By Theorem 8.2.1, the mean of this variable is n — 1, 

and the variance is 2(n — 1). Therefore, if T, is defined by Eq. (8.7.6), then 


E(T,)=(n—1)co? and Var(T,) =2(n — 1c7o%. (8.7.7) 
Thus, by Corollary 8.7.1, the M.S.E. of T, can be found as follows: 
E[(T, — 07)?]=[E(L.) — o°P + Var(T?) 
=[(n — 1c — 1fo4 + 2(n — 1)c?o4 (8.7.8) 
=[(n? — Ic? — 2(n — Ic + 1]Jo?. 


The coefficient of o* in Eq. (8.7.8) is simply a quadratic function of c. Hence, no mat- 
ter what o” equals, the minimizing value of c is found by elementary differentiation 
tobec=1/(n+1). 

In summary, we have established the following fact: Among all estimators of 
o* having the form in Eq. (8.7.6), the estimator that has the smallest M.S.E. for all 
possible values of ju and o? is Tyner = (1/ + 2) Va - X,,)°. In particular, 
T1(n41) has a smaller M.S.E. than both the M.L.E. G and the unbiased estimator 
or, Therefore, the estimators or and ae as well as all other estimators having the 
form in Eq. (8.7.6) with c 4 1/(n + 1), are inadmissible. Furthermore, it was shown 
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by C. Stein in 1964 that even the estimator 7}/(,1) is dominated by other estimators 
and that 7} /(,41) itself is therefore inadmissible. 

The estimators a2 and a are compared in Exercise 6 at the end of this section. 
Of course, when the sample size n is large, it makes little difference whether n, n — 1, 
orn +1 is used as the divisor in the estimate of 2; all three estimators ue ae, and 
T\(n41) Will be approximately equal. <l 


Limitations of Unbiased Estimation 


The concept of unbiased estimation has played an important part in the historical 
development of statistics, and the feeling that an unbiased estimator should be pre- 
ferred to a biased estimator is prevalent in current statistical practice. Indeed, what 
scientist wishes to be biased or to be accused of being biased? The very terminology 
of the theory of unbiased estimation seems to make the use of unbiased estimators 
highly desirable. 

However, as explained in this section, the quality of an unbiased estimator must 
be evaluated in terms of its variance or its M.S.E. Examples 8.7.3 and 8.7.6 illustrate 
the following fact: In many problems, there exist biased estimators that have smaller 
M.S.E. than every unbiased estimator for every possible value of the parameter. 
Furthermore, it can be shown that a Bayes estimator, which makes use of all relevant 
prior information about the parameter and which minimizes the overall M.S.E., is 
unbiased only in trivial problems in which the parameter can be estimated perfectly. 

Some other limitations of the theory of unbiased estimation will now be de- 
scribed. 


Nonexistence of an Unbiased Estimator In many problems, there does not exist 
any unbiased estimator of the function of the parameter that must be estimated. For 
example, suppose that X;,..., X, form Bernoulli trials for which the parameter p 
is unknown (0 < p < 1). Then the sample mean X,, will be an unbiased estimator of p, 
but it can be shown that there will be no unbiased estimator of p!/”. (See Exercise 7.) 
Furthermore, if it is known in this example that p must lie in the interval j <p< Z, 
then there is no unbiased estimator of p whose possible values are confined to that 
same interval. 


Inappropriate Unbiased Estimators Consider an infinite sequence of Bernoulli 
trials for which the parameter p is unknown (0 < p < 1), and let X denote the number 
of failures that occur before the first success is obtained. Then X has the geometric 
distribution with parameter p whose p.f. is given by Eq. (5.5.3). If it is desired to 
estimate the value of p from the observation X, then it can be shown (see Exercise 8) 
that the only unbiased estimator of p yields the estimate 1 if X =0 and yields the 
estimate 0 if X > 0. This estimator seems inappropriate. For example, if the first 
success is obtained on the second trial, that is, if X¥ = 1, then it is silly to estimate 
that the probability of success p is 0. Similarly, if X = 0 (the first trial is success), it 
seems silly to estimate p to be as large as 1. 

As another example of an inappropriate unbiased estimator, suppose that the 
random variable X has the Poisson distribution with unknown mean A (A > 0), and 
suppose also that it is desired to estimate the value of e~**. It can be shown (see 
Exercise 9) that the only unbiased estimator of e~** yields the estimate 1 if X is an 
even integer and the estimate —1if X is an odd integer. This estimator is inappropriate 
for two reasons. First, it yields the estimate 1 or —1 for a parameter e~**, which must 
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lie between 0 and 1. Second, the value of the estimate depends only on whether X is 
odd or even, rather than on whether X is large or small. 


Ignoring Information One mote criticism of the concept of unbiased estimation 
is that the principle of always using an unbiased estimator for a parameter 0 (when 
such exists) sometimes ignores valuable information that is available. As an example, 
suppose that the average voltage 6 in a certain electric circuit is unknown; this voltage 
is to be measured by a voltmeter for which the reading X has the normal distribution 
with mean 6 and known variance o*. Suppose also that the observed reading on the 
voltmeter is 2.5 volts. Since X is an unbiased estimator of 6 in this example, a scientist 
who wished to use an unbiased estimator would estimate the value of 6 to be 2.5 volts. 

However, suppose also that after the scientist reported the value 2.5 as his 
estimate of 6, he discovered that the voltmeter actually truncates all readings at 
3 volts, just as in Example 3.2.7 on page 106. That is, the reading of the voltmeter is 
accurate for any voltage less than 3 volts, but a voltage greater than 3 volts would be 
reported as 3 volts. Since the actual reading was 2.5 volts, this reading was unaffected 
by the truncation. Nevertheless, the observed reading would no longer be an unbiased 
estimator of @ because the distribution of the truncated reading X is not a normal 
distribution with mean 6. Therefore, if the scientist still wished to use an unbiased 
estimator, he would have to change his estimate of 6 from 2.5 volts to a different 
value. 

Ignoring the fact that the observed reading was accurate seems unacceptable. 
Since the actual observed reading was only 2.5 volts, it is the same as what would 
have been observed if there had been no truncation. Since the observed reading 
is untruncated, it would seem that the fact that there might have been a truncated 
reading is irrelevant to the estimation of 0. However, since this possibility does 
change the sample space of X and its probability distribution, it will also change 


the form of the unbiased estimator of 0. 


Summary 


An estimator 5(X) of g(@) is unbiased if Eg[5(X)] = g(@) for all possible values of 0. 
The bias of an estimator of g(@) is Eg[5(X)] — g(@). The M.S.E. of an estimator equals 
its variance plus the square of its bias. The M.S.E. of an unbiased estimator equals 
its variance. 


Exercises 


1. Let X;,..., X, be a random sample from the Poisson 
distribution with mean 0. 


a. Express the Varg(X;) as a function 0? = g(6). 


b. Find the M.L.E. of g(@) and show that it is 
unbiased. 


2. Suppose that X is a random variable whose distribu- 
tion is completely unknown, but it is known that all the 
moments E(X‘), fork =1,2,..., are finite. Suppose also 
that X,,..., X, form arandom sample from this distribu- 


tion. Show that fork = 1, 2,..., the kth sample moment 
(linha xe is an unbiased estimator of E(X*). 


3. For the conditions of Exercise 2, find an unbiased esti- 
mator of [E(X)[’. Hint: [E(X)P = E(X*) — Var(X). 


4. Suppose that a random variable X has the geometric 
distribution with unknown parameter p. (See Sec. 5.5.) 
Find a statistic 6(X) that will be an unbiased estimator of 


1/p. 


5. Suppose that a random variable X has the Poisson dis- 
tribution with unknown mean A (A > 0). Find a statistic 
5(X) that will be an unbiased estimator of e*. Hint: If 
E[6(X)] =e’, then 


Multiply both sides of this equation by e*, expand the right 
side in a power series in 4, and then equate the coefficients 
of A* on both sides of the equation for x =0,1,2,.... 


6. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean yw and un- 
known variance o7. Let oF and ot be the two estimators 


of o2, which are defined as follows: 


3 1 n _ ' 1 n _ 
oa = a xe = x and ét = 2 =a xee = x). 
i=l i—1 


Show that the M.S.E. of Ga is smaller than the M.S.E. of 
ot for all possible values of jz and 0”. 


7. Suppose that X;,..., X, form n Bernoulli trials for 
which the parameter p is unknown (0 < p < 1). Show that 
the expectation of every function 5(X,,..., X,,) isa poly- 
nomial in p whose degree does not exceed n. 


8. Suppose that a random variable X has the geometric 
distribution with unknown parameter p (0 < p < 1).Show 
that the only unbiased estimator of p is the estimator 6(X) 
such that 6(0) = 1 and 6(X) = 0 for X > 0. 


9. Suppose that a random variable X has the Poisson dis- 
tribution with unknown mean A (A > 0). Show that the 
only unbiased estimator of e~”* is the estimator 5(X) such 
that 6(X) = lif X is an even integer and 6(X) = —1if X is 


an odd integer. 


10. Consider an infinite sequence of Bernoulli trials for 
which the parameter p is unknown (0 < p < 1), and sup- 
pose that sampling is continued until exactly k successes 
have been obtained, where k is a fixed integer (k > 2). Let 
N denote the total number of trials that are needed to ob- 
tain the k successes. Show that the estimator (k — 1)/(N — 
1) is an unbiased estimator of p. 


11. Suppose that a certain drug is to be administered to 
two different types of animals A and B. It is known that 
the mean response of animals of type A is the same as 
the mean response of animals of type B, but the common 
value @ of this mean is unknown and must be estimated. It 
is also known that the variance of the response of animals 
of type A is four times as large as the variance of the 
response of animals of type B. Let Xj,..., X,, denote 
the responses of a random sample of m animals of type A, 
and let Y;,..., Y,, denote the responses of an independent 
random sample of n animals of type B. Finally, consider 
the estimator 6 =aX,, + (1—a)Y,. 
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a. For what values of a, m, and n is 6 an unbiased esti- 
mator of 6? 


b. For fixed values of m and n, what value of @ yields an 
unbiased estimator with minimum variance? 


12. Suppose that a certain population of individuals is 
composed of k different strata (k > 2), and that for i = 
1,...,, the proportion of individuals in the total pop- 
ulation who belong to stratum i is p;, where p; > 0 and 
5 Dp; = 1. We are interested in estimating the mean 
value jz of a certain characteristic among the total pop- 
ulation. Among the individuals in stratum 7, this charac- 
teristic has mean ju; and variance o where the value of 
4; is unknown and the value of a is known. Suppose that 
a stratified sample is taken from the population as follows: 
From each stratum i, a random sample of n; individuals is 
taken, and the characteristic is measured for each of these 
individuals. The samples from the & strata are taken inde- 
pendently of each other. Let X; denote the average of the 
n; Measurements in the sample from stratum 7. 


a. Show that uw = ae Pjl4;, and show also that f = 
pa p;X; is an unbiased estimator of ju. 


b. Letn= yy n; denote the total number of observa- 
tions in the k samples. For a fixed value of n, find the 
values of 71, ..., 1; for which the variance of (x will 
be a minimum. 


13. Suppose that Xj, ..., X,, form a random sample from 
a distribution for which the p.d-f. or the pf. is f(x|9), 
where the value of the parameter 6 is unknown. Let X¥ = 
(X1,..., X,), and let T be a statistic. Assume that 5(X) 
is an unbiased estimator of 6 such that E,[5(X)|T] does 
not depend on @. (If T is a sufficient statistic, as defined in 
Sec. 7.7, then this will be true for every estimator 5. The 
condition also holds in other examples.) Let 59(7) denote 
the conditional mean of 5(X) given T. 


a. Show that 59(7) is also an unbiased estimator of 0. 


b. Show that Varg(d9) < Varg(6) for every possible 
value of 6. Hint: Use the result of Exercise 11 in 
Sec. 4.7. 


14. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution on the interval [0, 0], where 
the value of the parameter 6 is unknown; and let Y, = 
max(X,,..., X,,). Show that [(m + 1)/n]Y,, is an unbiased 
estimator of 0. 


15. Suppose that a random variable X can take only the 
five values x = 1, 2, 3, 4, 5 with the following probabilities: 
fd =0, f(2\6) =e? -8), 
f(|0) =20(1— 0), f(4|0) =0(1 — 0)”, 
f (510) = (1-0)? 
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Here, the value of the parameter 6 is unknown (0 < @ < 1). c. Let 6) be a number such that 0 < 6) < 1. Determine 


a. Verify that the sum of the five given probabilities is 
1 for every value of 6. 


a constant cg such that when 6 = 6p, the variance of 
6,)(X) is smaller than the variance of 4,(X) for every 
other value of c. 


b. Consider an estimator 6,(X) that has the following 


form: 


6.(1) = 1, 6,(2) =2 — 2c, 6,.(3) =c, 16. Reconsider the conditions of Exercise 3. Suppose that 


8,(4) =1— 2c, 5,(5) =0. 


n = 2, and we observe X, = 2 and Xj = —1. Compute the 
value of the unbiased estimator of [E(X)f found in Ex- 


Show that for each constant c, 6,(X) is an unbiased ercise 3. Describe a flaw that you have discovered in the 


estimator of 6. 


Example 
8.8.1 


estimator. 


* 8.8 Fisher Information 


This section introduces a method for measuring the amount of information that 
a sample of data contains about an unknown parameter. This measure has the 
intuitive properties that more data provide more information, and more precise 
data provide more information. The information measure can be used to find 
bounds on the variances of estimators, and it can be used to approximate the 
variances of estimators obtained from large samples. 


Definition and Properties of Fisher Information 


Studying Customer Arrivals. A store owner is interested in learning about customer 
arrivals. She models arrivals during the day as a Poisson process (see Definition 5.4.2) 
with unknown rate 6. She thinks of two different possible sampling plans to obtain 
information about customer arrivals. One plan is to choose a fixed number, n, of 
customers and to see how long, X, it takes until n customers arrive. The other plan 
is to observe for a fixed length of time, r, and count how many customers, Y, arrive 
during time t. That is, the store owner can either observe a Poisson random variable, 
Y, with mean t@ or observe a gamma random variable, X, with parameters n and 6. 
Is there any way to address the question of which sampling plan is likely to be more 
informative? | 


The Fisher information is one property of a distribution that can be used to 
measure how much information one is likely to obtain from a random variable or 
a random sample. 


The Fisher Information in a Single Random Variable In this section, we shall 
introduce a concept, called the Fisher information, that enters various aspects of 
the theory of statistical inference, and we shall describe a few uses of this concept. 

Consider a random variable X for which the p.f. or the p.d-f. is f(x|@). It is 
assumed that f (x|@) involves a parameter 6 whose value is unknown but must lie ina 
given open interval Q of the real line. Furthermore, it is assumed that X takes values 
in a specified sample space S, and f(x|@) > 0 for each value of x € S and each value 
of 0 € Q. This assumption eliminates from consideration the uniform distribution on 
the interval [0, 6], where the value of 0 is unknown, because, for that distribution, 
f(x|@) > 0 only when x <6 and f(x|@) =0 when x > 6. The assumption does not 
eliminate any distribution where the set of values of x for which f(x|@) > 0 is a fixed 
set that does not depend on 6. 


Definition 
8.8.1 


Theorem 
8.8.1 
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Next, we define A(x|@) as follows: 
A(x|0) = log f (x10). 


It is assumed that for each value of x € S, the pf. or p.d.f. f(x|@) is a twice 
differentiable function of 6, and we let 


; i) ; a2 
N(x|0) = 59 lO) and A”(x|0) = 593 t*l9)- 


Fisher Information in a Random Variable. Let X be a random variable whose distribu- 
tion depends on a parameter @ that takes values in an open interval Q of the real line. 
Let the p.f. or p.d.f. of X be f(x|@). Assume that the set of x such that f(x|@) > 0 is 
the same for all @ and that A(x|6) = log f(x|@) is twice differentiable as a function of 
0. The Fisher information I (@) in the random variable X is defined as 


1(6) = Eo{[A'(X10)P}. (8.8.1) 
Thus, if f(x|6) is a p.d.f., then 
(0) = [ecinr ree dx. (8.8.2) 
Ss 


If f(x|@) is a p.f, the integral in Eq. (8.8.2) is replaced by a sum over the points in S. 
In the discussion that follows, we shall assume for convenience that f(x|@) is a p.d.f. 
However, all the results hold also when f (x|@) is a pf. 

An alternative method for calculating the Fisher information sometimes proves 
more useful. 


Assume the conditions of Definition 8.8.1. Also, assume that two derivatives of 
J's f (x|@)dx with respect to 6 can be calculated by reversing the order of integration 
and differentiation. Then the Fisher information also equals 


1(0) = —E,[A"(X|6)]. (8.8.3) 
Another expression for the Fisher information is 

1 (0) = Vara[A'(X|9)]. (8.8.4) 
Proof We know that / s f (|) dx =1 for every value of 0 ¢ Q. Therefore, if the 
integral on the left side of this equation is differentiated with respect to 6, the result 
will be 0. We have assumed that we can reverse the order in which we perform the 
integration with respect to x, and the differentiation with respect to 9, and will still 


obtain the value 0. In other words, we shall assume that we can take the derivative 
inside the integral sign and obtain 


i f'(x|@¢)dx=0 fordeQ. (8.8.5) 
S 


Furthermore, we have assumed that we can take a second derivative with respect to 
6 “inside the integral sign” and obtain 


/ f(x|0)dx =0 fordeQ. (8.8.6) 
Ss 
Since A’(x|0) = f/(x|9)/f (x19), then 


E9['(X|6)] = [ A (x18) f (x16) dx = i f'(xl6) dx. 
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Example 
8.8.2 


Example 
8.8.3 


Hence, it follows from Eq. (8.8.5) that 
E,[a'(X|0)] =0. (8.8.7) 


Since the mean of 4’(X|@) is 0, it follows from Eq. (8.8.1) that Eq. (8.8.4) holds. 
Next, note that 


F(xlO) fl) — [f’@lOF 


yi 4) = 
om [FIP 
f" lO) ' 2 
= — —-|rV(x|9)/. 
Foxy OO 
Therefore, 
E,[a"(X|0)] = [ f'"(x|0) dx — 1(6). (8.8.8) 
It follows from Eggs. (8.8.8) and (8.8.6) that Eq. (8.8.3) holds. | 


In many problems, it is easier to determine the value of /(@) from Eq. (8.8.3) than 
from Eqs. (8.8.1) or (8.8.4). 


The Bernoulli Distributions. Suppose that X has the Bernoulli distribution with pa- 
rameter p. We shall determine the Fisher information /(p) in X. 

In this example, the possible values of X are the two values 0 and 1. For x =0 
or 1, 


Ma|p) = log f(x|p) = « log p + (1 — x) log(1 — p). 


Hence, 

x 1-x 

N(x|p) == - 

p AS 

and 
1-x 
M"(alp) = E + |: 

po apy 

Since E(X) = p, the Fisher information is 
1 dl 1 


I = —E[A" (xX = = . 
(p) [A"(X1p)] . ie 
Recall from Eq. (4.3.3) that Var(X) = p(1 — p), so the more precise (smaller vari- 
ance) X is the more information it provides. 
In this example, it can be readily verified that the assumptions made in the proof 
of Theorem 8.8.1 are satisfied. Indeed, because X can take only the two values 0 
and 1, the integrals in Eqs. (8.8.5) and (8.8.6) reduce to summations over the two 
values x = 0 and x = 1. Since it is always possible to take a derivative “inside a finite 
summation” and to differentiate the sum term by term, Eqs. (8.8.5) and (8.8.6) must 
be satisfied. < 


The Normal Distributions. Suppose that X has the normal distribution with unknown 
mean jz and known variance o”. We shall determine the Fisher information / (jz) in X. 
For —oo <x <M, 


Cl) = -5 iosOne) 2 =O" 


Definition 
8.8.2 


Theorem 
8.8.2 
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Hence, 


Nol) =2SE and 2") =—S. 
Oo Oo 


It now follows from Eq. (8.8.3) that the Fisher information is 
1 
I =—. 
(MW) = 5 


Since Var(X) = 07, we see again that the more precise (smaller variance) X is, the 
more information it provides. 

In this example, it can be verified directly (see Exercise 1 at the end of this 
section) that Eqs. (8.8.5) and (8.8.6) are satisfied. < 


It should be emphasized that the concept of Fisher information cannot be applied 
to a distribution, such as the uniform distribution on the interval [0, 6], for which the 
necessary assumptions are not satisfied. 


The Fisher Information in a Random Sample When we have a random sample 
from a distribution, the Fisher information is defined in an analogous manner. In- 
deed, Definition 8.8.2 subsumes Definition 8.8.1 as the special case in which n = 1. 


Fisher Information in a Random Sample. Suppose that X = (X,,..., X,) form a ran- 
dom sample from a distribution for which the p.f. or p.d.f. is f(x|@), where the value 
of the parameter @ must lie in an open interval Q of the real line. Let f,,(x|@) denote 
the joint p.f. or joint p.d-f. of X. Define 


An(x|0) =log f, (x1). (8.8.9) 


Assume that the set of x such that f, (v|0) > 0 is the same for all @ and that log f, (x|0) 
is twice differentiable with respect to 6. The Fisher information I, (@) in the random 
sample X is defined as 


1,(0) = Eo{[A) (X10) F}. 


For continuous distributions, the Fisher information /,,(@) in the entire sample is 
given by the following n-dimensional integral: 


100) =f... [ VycrioyP fcr dx,...dXp». 
S Ss 


For discrete distributions, replace the n-dimensional integral by an n-fold summation. 
Furthermore, if we again assume that derivatives can be passed under the inte- 
grals, then we may express J/,,(@) in either of the following two ways: 


I,(0) = Vare[A' (X|9)] (8.8.10) 
or 
1,(0) = —E,[A/ (X|0)]. (8.8.11) 
We shall now show that there is a simple relation between the Fisher information 
[,(@) in the entire sample and the Fisher information /(@) in a single observation X;. 
Under the conditions of Definitions 8.8.1 and 8.8.2, 
1,0) =nI (6). (8.8.12) 
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8.8.4 


Example 
8.8.5 


Theorem 
8.8.3 


In words, the Fisher information in a random sample of n observations is simply n 
times the Fisher information in a single observation. 


Proof Since f,(x|0) = f(x4|0) ... f(x,|9), it follows that 


n 
An (X10) = D> (10). 
i=l 
Hence, 


M(x|0) = ~ d(x; 10). (8.8.13) 
i=l 


Since each observation X; has the p.d.f. f(«|@), the Fisher information in each X; 
is 7(@). It follows from Eqs. (8.8.3) and (8.8.11) that by taking expectations on both 
sides of Eq. (8.8.13), we obtain Eq. (8.8.12). r 


Studying Customer Arrivals. Return to the store owner in Example 8.8.1 who is trying 
to choose between sampling a Poisson random variable, Y, with mean 6 or sampling 
a gamma random variable, X, with parameters n and 9. The reader can compute the 
Fisher information in each random variable in Exercises 3 and 19 in this section. We 
shall label them /y(@) and /y(@). They are 


n t 
Ty(0)=— and J,(0)=-. 
x(@) A y(9) 


Which is larger will clearly depend on the particular values of n, t, and 6. Both n and 
t can be chosen by the store owner, but @ is unknown. In order for /y(@) = Jy (8), it 
is necessary and sufficient that n = 10. This relation actually makes intuitive sense. 
For example, if the store owner chooses to observe Y, then the total number N of 
customers observed will be random and N = Y. The mean of AN is then E(Y) = 16. 
Similarly, if the store owner chooses to observe X, then the length of time T that it 
takes to observe n customers will be random. In fact, T = X, and the mean of T0 
is n. So long as the manufacturer is comparing sampling plans that are expected to 
observe the same numbers of customers or observe for the same length of time, the 
two sampling plans should provide the same amount of information. < 


The Information Inequality 


Studying Customer Arrivals. Another way that the store owner in Example 8.8.4 could 
choose between the two sampling plans is to compare the estimators that she will 
use to make inferential statements about customer arrivals. For example, she may 
want to estimate 0, the rate of customer arrivals. Alternatively, she may want to 
estimate 1/0, the mean time between customer arrivals. Each sampling plan lends 
itself to estimation of both parameters. Indeed, there are unbiased estimators of both 
parameters available from at least one of these sampling plans. J 


As one application of the results that have been derived concerning Fisher 
information, we shall show how the Fisher information can be used to determine 
a lower bound for the variance of an arbitrary estimator of the parameter 6 in a 
given problem. The following result was independently developed by H. Cramér and 
C. R. Rao during the 1940s. 


Cramér-Rao (Information) Inequality. Suppose that X = (X,,..., X,,) form a random 
sample from a distribution for which the p.d.f. is f(x|0). Suppose also that all the 
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assumptions which have been made about f (x|@) thus far in this section continue to 
hold. Let T = r(X) be a statistic with finite variance. Let m(0) = E,(T). Assume that 
m(@) is a differentiable function of 6. Then 


[m’(0)F 
nI(0) ~ 


Varg(T) = (8.8.14) 


There will be equality in (8.8.14) if and only if there exist functions u(@) and v(@) that 
may depend on @ but do not depend on X and that satisfy the relation 

T =u(0)d' (X|0) + v(). (8.8.15) 
Proof The inequality derives from applying Theorem 4.6.3 to the covariance be- 


tween T and the random variable 4/ (X|6) defined in Eq. (8.8.9). Since 4 (x|@) = 
i, (x|0)/f,(x|0), it follows just as for a single observation that 


EleXiOl= fi... [cet dx,...dx, =0. 


Therefore, 
Cove [T, 2’ (X|0)] = Eo [TX (X10)] 
=f... f reox,ceia) cel dx,... dX 
=f... f reo nee) dx, ...dXp.- (8.8.16) 
Next, write 
mo) = f.. [reer fal) dx,...dx, for@eQ. (8.8.17) 


Finally, suppose that when both sides of Eq. (8.8.17) are differentiated with respect 
to 6, the derivative can be taken “inside the integrals” on the left side. Then 


m'(0) = / _ / r(x) fi (x0) dx,...dx, for@eQ. (8.8.18) 
S S 
It follows from Eqs. (8.8.16) and (8.8.18) that 
Cove[T, 4) (X|0)]=m'(0) ford eQ. (8.8.19) 
Theorem 4.6.3 says that 
{Cov,[T, 41 (X|0)]}* < Varg(T) Vare[A’ (X|9)]. (8.8.20) 


Therefore, it follows from Eqs. (8.8.10), (8.8.12), (8.8.19), and (8.8.20) that Eq. 
(8.8.14) holds. 

Finally, notice that (8.8.14) is an equality if and only if (8.8.20) is an equality. This, 
in turn, is an equality if and only if there exist nonzero constants a and b and aconstant 
c such that aT + bi,,(X|@) = c. This last claim follows from the similar statement in 
Theorem 4.6.3. In all of the calculations concerned with Fisher infomration, we have 
been treating 6 as a constant; hence, the constants a, b, and c just mentioned can 
depend on 6, but must not depend on X. Then u(@) = b/a and v(@) = c/a. | 


The following simple corollary to Theorem 8.8.3 gives a lower bound on the 
variance of an unbiased estimator of 6. 
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Example 
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Cramér-Rao Lower Bound on the Variance of an Unbiased Estimator. Assume the as- 
sumptions of Theorem 8.8.3. Let T be an unbiased estimator of 6. Then 


Varg(T) > 


nI(0) 


Proof Because T is an unbiased estimator of 6, m(@) =@ and m’(6) = 1 for every 
value of 6 € Q. Now apply Eq. (8.8.14). | 


In words, Corollary 8.8.1 says that the variance of an unbiased estimator of 6 cannot 
be smaller than the reciprocal of the Fisher information in the sample. 


Unbiased Estimation of the Parameter of an Exponential Distribution. Let X,,..., X,, 
be a random sample of size n > 2 from the exponential distribution with parameter 
B. That is, each X; has p.d-f. f(x|6) = B exp(—Bx) for x > 0. Then 


A(x|B) = log(B) — Bx, 
(x16) = ; = 


1 
p2 . 
It can be verified that the conditions required to establish (8.8.3) hold in this example. 
Then the Fisher information in one observation is 


1 1 

cial | | ~ B 
The information in the whole sample is then J,,(8) =n/A*. Consider the estimator 
T =(n—1)/>°*_, X;. Theorem 5.7.7 says that }°"_, X; has the gamma distribution 
with parameters n and £. In Exercise 21 in Sec. 5.7, you proved that the mean and 
variance of 1/ }*"_, X; are B/(n — 1) and B?/[(n — 1)?(n — 2)], respectively. Thus, T is 
unbiased and its variance is B”/(n — 2). The variance is indeed larger than the lower 


bound, 1/1,(8) = 6?/n. The reason the inequality is strict is that T is not a linear 
function of 4/ (X|6). Indeed, T is 1 over a linear function of 4) (X|@). 


On the other hand, if we wish to estimate m(8) = 1/8, U = X,, is an unbiased 
estimator with variance 1/(n67). The information inequality says that the lower 
bound on the variance of an estimator of 1/6 is 


m'(B)? _ (—1/B?)? _ 1 


n/p? n/p? np?” 
In this case, we see that there is equality in (8.8.14). < 


x" (x|B) = - 


Studying Customer Arrivals. Return to the store owner in Example 8.8.5 who wants 
to compare the estimators of 6 and 1/6 that she could compute from either the 
Poisson random variable Y or the gamma random variable X. The case of unbiased 
estimators based on X was already handled in Example 8.8.6, where our X has the 
same distribution as )*”_, X; in that example when 6 = f. Hence, X/n is an unbiased 
estimator of 1/6 whose variance equals the Cramér-Rao lower bound, and (n — 1)/X 
is an unbiased estimator of 8 whose variance is strictly larger than the lower bound. 
Since E,(Y) = 10, we see that Y/t is an unbiased estimator of 6 whose variance is also 
known to be @/t, which is the Cramér-Rao lower bound. Unfortunately, there is no 


Example 
8.8.8 


Definition 
8.8.3 


Example 
8.8.9 
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unbiased estimator of 1/6 based on Y alone. The estimator 6(Y) = t/(Y + 1) satisfies 
E,[8(Y) |= ; [1 = e] 


If t is large and @ is not too small, the bias will be small, but it is impossible to find an 
unbiased estimator. The reason is that the mean of every function of Y is exp(—16) 
times a power series in 9. Every such function is differentiable in a neighborhood of 
6 = 0. The function 1/6 is not differentiable at 6 = 0. < 


Efficient Estimators 


Variance of a Poisson Distribution. In Example 8.7.5, we presented a collection of 
different unbiased estimators of the the variance of a Poisson distribution based 
on a random sample X = (Xj, ..., X,,) from that distribution. After that example, 
we made the claim that one of the estimators has the smallest variance among the 
entire collection. The information inequality gives us a way to address comparisons 
of such collections of estimators without necessarily listing them all or computing 
their variances. < 


An estimator whose variance equals the Cramér-Rao lower bound makes the 
most efficient use of the data X in some sense. 


Efficient Estimator. It is said that an estimator T is an efficient estimator of its expec- 
tation m(@) if there is equality in (8.8.14) for every value of 6 € Q. 


One difficulty with Definition 8.8.3 is that, in a given problem, there may be no 
estimator of a particular function m(6@) whose variance actually attains the Cramér- 
Rao lower bound. For example, if the random variable X has the normal distribution 
for which the mean is 0 and the standard deviation o is unknown (o > 0), then it 
can be shown that the variance of every unbiased estimator of o based on the single 
observation X is strictly greater than 1// (oc) for every value of o > 0 (see Exercise 9). 
In Example 8.8.6, no efficient estimator of 6 exists. 

On the other hand, in many standard estimation problems there do exist efficient 
estimators. Of course, the estimator that is identically equal to a constant is an effi- 
cient estimator of that constant, since the variance of this estimator is 0. However, as 
we shall now show, there are often efficient estimators of more interesting functions 
of 6 as well. 

According to Theorem 8.8.3, there will be equality in the information inequality 
(8.8.14) if and only if the estimator T is a linear function of 4’ (X|6). It is possible 
that the only efficient estimators in a given problem will be constants. The reason is 
as follows: Because T is an estimator, it cannot involve the parameter 0. Therefore, 
in order for T to be efficient, it must be possible to find functions u(@) and v(@) such 
that the parameter @ will actually be canceled from the right side of Eq. (8.8.15), and 
the value of T will depend only on the observations X and not on 0. 


Sampling from a Poisson Distribution. Suppose that X,,..., X,, formarandom sample 
from the Poisson distribution with unknown mean 6 (6 > 0). We shall show that X,, 
is an efficient estimator of 6. 
The joint p.f. of X;,..., X,, can be written in the form 
en gnXn 


Fr(xl0) = TG) 
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Therefore, 
n 
Ay(X|0) = —n0 + nX, log — }° log(X;!) 
i=l 
and 
iin woe, (8.8.21) 


If we now let u(@) = 6/n and v(@) = 8, then it is found from Eq. (8.8.21) that 
X, =u(O)X (X10) + v0). 


Since the statistic X,, has been represented as a linear function of A’ (X|@), it 
follows that X,, is an efficient estimator of its expectation @. In other words, the 
variance of X,, will attain the lower bound given by the information inequality, which 
in this example is 0/n (see Exercise 3). This fact can also be verified directly. < 


Unbiased Estimators with Minimum Variance Suppose that in a given problem 
a particular estimator T is an efficient estimator of its expectation m(@), and let T; 
denote any other unbiased estimator of m(@). Then for every value of 6 € Q, Varg(T) 
will be equal to the lower bound provided by the information inequality, and Varg(T;) 
will be at least as large as that lower bound. Hence, Varg(T) < Varg(T;) for 6 € Q. In 
other words, if T is an efficient estimator of m(@), then among all unbiased estimators 
of m(6), T will have the smallest variance for every possible value of 6. 


Variance of a Poisson Distribution. In Example 8.8.9, we saw that X,, is an efficient 
estimator of the mean @ of a Poisson distribution. Therefore, for every value of 6 > 0, 
X,, has the smallest variance among all unbiased estimators of @. Since 6 is also the 
variance of the Poisson distribution with mean 6, we know that X,, has the smallest 
variance among all unbiased estimators of the variance. This establishes the claim 
that was made without proof after Example 8.7.5. In particular, the estimator a 
in Example 8.7.5 is not a linear function of 4/ (X|@), and hence its variance must 
be strictly larger than Cramér-Rao lower bound. Similarly, the other estimators in 
Eq. (8.7.5) must each have variance larger than the Cramér-Rao lower bound. << 


Properties of Maximum Likelihood Estimators for Large Samples 


Suppose that X,,..., X, form a random sample from a distribution for which the 
p.d.f. or the p.f. is f(«|@), and suppose also that f(x|0) satisfies conditions similar to 
those which were needed to derive the information inequality. For each sample size 
n, let 6, denote the M.L.E. of 6. We shall show that if n is large, then the distribution 
of 6, is approximately the normal distribution with mean 6 and variance 1/[n/ ()]. 


Asymptotic Distribution of an Efficient Estimator. Assume the assumptions of Theo- 
rem 8.8.3. Let T be an efficient estimator of its mean m(@). Assume that m’(@) is 
never 0). Then the asymptotic distribution of 
[n1@)]}'? 
m'(0) 


is the standard normal distribution. 


[T — m(0)] 


Theorem 
8.8.5 
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Proof Consider first the random variable 4/ (X|0). Since 2,,(X|@) = )77_, A(X; 14), 
then 


ai (X80) = 2 NOG). 
i=l 


Furthermore, since the n random variables X,,..., X,, are i.1.d., the n random vari- 
ables 4’(X1|0), ..., A’(X,,|0) will also be 1.i.d. We know from Eqs. (8.8.7) and (8.8.4) 
that the mean of each of these variables is 0, and the variance of each is / (6). Hence, 
it follows from the central limit theorem of Lindeberg and Lévy (Theorem 6.3.1) that 
the asymptotic distribution of the random variable 4! (X|@)/[nI (0)]!/? is the standard 
normal distribution. 

Since T is an efficient estimator of m(@), we have 


[m'(0)F 
nI(0) 
Furthermore, there must exist functions u(@) and v(@) that satisfy Eq. (8.8.15). Be- 


cause the random variable 4/ (X|@) has mean 0 and variance n/ (6), it follows from 
Eq. (8.8.15) that 


Eg(T)=m(6) and Varg(T) = 


(8.8.22) 


E,(T)=v(0) and Var,(T) =[u(6)Pn1 (6). 


When these values for the mean and the variance of T are compared with the values 
in Eq. (8.8.22), we find that v(@) = m(@) and |u(@)| = |m’(@)|/[n1 (6) ]. To be specific, 
we shall assume that u(0) = m’(0)/[nI(@)], although the same conclusions would be 
obtained if u(6) = —m'(@)/[nI(6)]. 

Next, substitute the values u(@) = m'(0)/[nI(0)]and v(0) = m(6) into Eq. (8.8.15) 
to obtain 


m'(6) ., 
= ——)' (X|0 0). 
nI(6) n(X|0) + m(0) 
Rearranging this equation slightly yields 
1/2 nr (X|0 
PO = aS (8.8.23) 
m'(0) [n1(0)]*/2 


We have already shown that the asymptotic distribution of the random variable 
on the right side of Eq. (8.8.23) is the standard normal distribution. Therefore, the 
asymptotic distribution of the random variable on the left side of Eq. (8.8.23) is also 
the standard normal distribution. rT 


Asymptotic Distribution of an M.L.E It follows from Theorem 8.8.4 that if the 
M.LE. 6, is an efficient estimator of @ for each value of n, then the asymptotic 
distribution of [nJ Ol 26, — @) is the standard normal distribution. However, it can 
be shown that even in an arbitrary problem in which 6, is not an efficient estimator, 
[n1(6)}/2(6, — 6) has this same asymptotic distribution under certain conditions. 
Without presenting all the required conditions in full detail, we can state the following 
result. The proof of this result can be found in Schervish (1995, chapter 7). 


Asymptotic Distribution of M.L.E. Suppose that in an arbitrary problem the M.L.E. 6, 
is determined by solving the equation 4’ (x|@) = 0, and in addition both the second 
and third derivatives A/(x|6) and A’”"(x|@) exist and satisfy certain regularity condi- 
tions. Then the asymptotic distribution of [n/(6)]!/2(6, — @) is the standard normal 


distribution. | 
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In practical terms, Theorem 8.8.5 states that in most problems in which the sam- 
ple size n is large, and the M.L_E. 6, is found by differentiating the likelihood function 
f,(x|9) or its logarithm, the distribution of [n(6)]!/2(6, — 6) will be approximately 
the standard normal distribution. Equivalently, the distribution of 6, will be approx- 
imately the normal distribution with mean 6 and variance 1/[n/(6)]. Under these 
conditions, it is said that 6, is an asymptotically efficient estimator. 


Estimating the Standard Deviation of a Normal Distribution. Suppose that X,,..., X, 
form a random sample from the normal distribution with known mean 0 and un- 
known standard deviation o (o > 0). It can be shown that the M.L.E. of o is 


Also, it can be shown (see Exercise 4) that the Fisher information in a single observa- 
tion is I(o) = 2/07. Therefore, if the sample size n is large, the distribution of 6 will 
be approximately the normal distribution with mean o and variance o7/(2n). < 


For cases in which it is difficult to compute the M.L.E., there is a result similar 
to Theorem 8.8.5. The proof of Theorem 8.8.6 can also be found as a special case of 
theorem 7.75 in Schervish (1995). 


Efficient Estimation. Assume the same smoothness conditions on the likelihood func- 
tion as in Theorem 8.8.5. Assume that 6,, is a sequence of estimators of @ such that 
J/n(6,, — @) converges in distribution to some distribution (it doesn’t matter what dis- 
tribution). Use 6,, as the starting value, and perform one step of Newton’s method 
(Definition 7.6.2) toward finding the M.L.E. of @. Let the result of this one step be 
called 6*. Then the asymptotic distribution of [n/ (0) |! 20% — 0) is the standard nor- 
mal distribution. 7 


A typical choice of 6, in Theorem 8.8.6 is a method of moments estimator (Defi- 
nition 7.6.3). Example 7.6.6 illustrates such an application of Theorem 8.8.6 when 
sampling from a gamma distribution. 


The Bayesian Point of View Another general property of the M.L.E. 6, pertains 
to making inferences about a parameter @ from the Bayesian point of view. Suppose 
that the prior distribution of 6 is represented by a positive and differentiable p.d.f. 
over the interval Q, and the sample size n is large. Then under conditions similar to 
the regularity conditions that are needed to assure the asymptotic normality of the 
distribution of 6,,, it can be shown that the posterior distribution of 0, after the values 
of X;,..., X, have been observed, will be approximately the normal distribution 
with mean 6, and variance 1/[n/(6,)]. 


The Posterior Distribution of the Standard Deviation. Suppose again that X;,..., X,, 
form a random sample from the normal distribution with known mean 0 and un- 
known standard deviation o. Suppose also that the prior p.d.f. of o is a positive and 
differentiable function for o > 0, and the sample size n is large. Since I (0) = A/a at 
follows that the posterior distribution of o will be approximately the normal distri- 
bution with mean G and variance 67/(2n), where G is the M.L.E. of o calculated from 
the observed values in the sample. Figure 8.9 illustrates this approximation based ona 


Figure 8.9 Posterior p.d.f. of 
o and approximation based 


on Fisher information in 
Example 8.8.12. 


2, 
& 
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Density 4 


Posterior 


seeeeee Approximation 


sample of n = 401.1.d. simulated normal random variables with mean 0 and variance 1. 
In this sample, the M.L.E. was 6 = 1.061. Figure 8.9 shows the actual posterior p.d.f. 
based on an improper prior with “p.d.f.” 1/o together with the approximate normal 
posterior p.d.f. with mean 1.061 and variance 1.0617/80 = 0.0141. 4 


Fisher Information for Multiple Parameters 


Example 
8.8.13 


Definition 
8.8.4 


Example 
8.8.14 


Sample from a Normal Distribution. Let ¥ = (X,..., X,) be arandom sample from 
the normal distribution with mean jw and variance o. Is there an analog to Fisher 
information for the vector parameter 0 = (yu, a)? < 


In the spirit of Definition 8.8.1 and Theorem 8.8.1, we define Fisher information 
in terms of derivatives of the logarithm of the likelihood function. We shall define 
the Fisher information in a random sample of size n with the understanding that the 
Fisher information in a single random variable corresponds to a sample size of n = 1. 


Fisher Information for a Vector Parameter. Suppose that X = (X;,..., X,) form a 
random sample from a distribution for which the p.d.f. is f(x|6), where the value 
of the parameter 6 = (6), ..., ,) must lie in an open subset Q of a k-dimensional 
real space. Let f,,(x|@) denote the joint p.d.f. or joint pf. of X. Define 


An (X10) = log f, (16). 


Assume that the set of x such that f, (v|0) > 0 is the same for all @ and that log f, (x|0) 
is twice differentiable with respect to 6. The Fisher information matrix I,,(@) in the 
random sample X is defined as the k x k matrix with (i, j) element equal to 


0 0, 
In,i,j 9) = Cove | Zac, acai 


Sample from a Normal Distribution. In Example 8.8.13, let 6, = 1 and 6) =o”. As in 
Eq. (7.5.3), we obtain 


1 n 
An(X18) = — > log(2n) : log(6) -— =, YX; — 6). 
2 j=1 
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The first partial derivatives are 


a oo 

— A, (x10) = — D(X; — ), (8.8.24) 
an 92 ia 

0 n dee 

— A, (x|6) = — + yy. 8.8.25 
vn n(X|9) 75 33 ») ( ) 


Since the means of the two random variables above are both 0, their covariances 
are the means of the products. The distribution of }*"_,(X; — 4) is the normal 
distribution with mean 0 and variance n6. The distribution of }*"_,(X; — 01)7/0> is 


the x? distribution with n degrees of freedom. So the variance of (8.8.24) is n/0), and 
the variance of (8.8.25) is 2n/05. The mean of the product of (8.8.24) and (8.8.25) is 
0 because the third central moment of a normal distribution is 0. This makes 


a 0 

[,(@) = : n . < 
O a 
65 


The results for one-dimensional parameters all have versions for k-dimensional 
parameters. For example, in Eq. (8.8.3), A’(X|0) is replaced by the k x k matrix of 
second partial derivatives. In the Cramér-Rao inequality, we need the inverse of 
the matrix J,(0), and m’'(@) must be replaced by the vector of partial derivatives. 
Specifically, if T is a statistic with finite variance and mean m(@), then 
a 
a a a 
Vary (T) = (me eee = mo)) T,(0)— : (8.8.26) 
00, 00, 9 
3a, () 
Also, the inequality in (8.8.26) is equality if and only if T is a linear function of the 
vector 


F F 
—d,, (18), ..., —A, (|) } . 8.8.27 
(= WCB) oy Sy )) (8.8.27) 


Sample from a Normal Distribution. In Example 8.8.14, the coordinates of the vector 
in (8.8.27) are linear functions of the two random variables )~/_, X; and )~)_, X?. 
So, the only statistics whose variances equal the lower bound in (8.8.26) are of the 
form T =a Y"_, X; +b )_, X? +c. The mean of such a statistic T is 


Eo(T) =an0, + bn(0; + 67) +c. (8.8.28) 


In particular, it is impossible to obtain 6, as a special case of (8.8.28). There is no 
efficient unbiased estimator of 6, = o”. It can be proven that (o’)*, which was defined 
in Eq. (8.4.3), is an unbiased estimator that has minimum variance among all unbiased 
estimators. The proof of this fact is beyond the scope of this text. The variance of (0’)” 


is 26; /(n — 1), while the Cramér-Rao lower bound is 20; /n. <l 
Multinomial Distributions. Let X¥ = (X1,..., X,) have the multinomial distribution 
with parameters n and p= (pj, ..., p;) as defined in Definition 5.9.1. Finding the 


Fisher information in this example involves a subtle point. The parameter vector p 
takes values in the set 


{pi py+---+pp_=1, all p; = O}. 


No subset of this set is open. Hence, no matter what set we choose for the param- 
eter space, Definition 8.8.4 does not apply to this parameter. However, there is an 
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equivalent paramter p* = (py, ..., Py_—1) that takes values in the set 
{p": pit +++ pa <1, all p; > 0}, 


which has nonempty interior. With this version of the parameter, and assuming that 
the parameter space is the interior of the set above, it is straightforward to calculate 
the Fisher information, as in Exercise 20. <J 


%, 
“9 


Summary 


Fisher information attempts to measure the amount of information about a parame- 
ter that a random variable or sample contains. Fisher information from independent 
random variables adds together to form the Fisher information in the sample. The 
information inequality (Cramér-Rao lower bound) provides lower bounds on the 
variances of all estimators. An estimator is efficient if its variance equals the lower 
bound. The asymptotic distribution of a maximum likelihood estimator of 6 is (under 
regularity conditions) normal with mean @ and variance equal to 1 over the Fisher 
information in the sample. Also, for large sample sizes, the posterior distribution of 
@ is approximately normal with mean equal to the M.L.E. and variance equal to 1 


over the Fisher information in the sample evaluated at the M.L.E. 


Exercises 


1. Suppose that a random variable X has a normal distri- 
bution for which the mean jp is unknown (—oo < pt < oo) 
and the variance o? is known. Let Ff (x|) denote the p.d-f. 
of X, and let f’(x|j) and f”(x|w) denote the first and sec- 
ond partial derivatives with respect to jz. Show that 


[. f'(x|w) dx =0 and i. f(x) dx =0. 


2. Suppose that X has the geometric distribution with 
parameter p. (See Sec. 5.5.) Find the Fisher information 
I(p) in X. 


3. Suppose that a random variable X has the Poisson dis- 
tribution with unknown mean 6 > 0. Find the Fisher infor- 
mation /(@) in X. 


4. Suppose that a random variable has the normal dis- 
tribution with mean 0 and unknown standard deviation 
o > 0. Find the Fisher information J(o) in X. 


5. Suppose that a random variable X has the normal dis- 
tribution with mean 0 and unknown variance o” > 0. Find 
the Fisher information /(o2) in X. Note that in this exer- 
cise the variance o? is regarded as the parameter, whereas 
in Exercise 4 the standard deviation o is regarded as the 
parameter. 


6. Suppose that X is a random variable for which the p.d.f. 
or the p.f. is f(x|@), where the value of the parameter 0 
is unknown but must lie in an open interval Q. Let [p(@) 
denote the Fisher information in X. Suppose now that the 
parameter @ is replaced by a new parameter ju, where 
06=y(u), and y is a differentiable function. Let 1,(j) 


denote the Fisher information in X when the parameter 
is regarded as jx. Show that 


L(w) =[W' (Poly w)- 


7. Suppose that X;,..., X, form a random sample from 
the Bernoulli distribution with unknown parameter p. 
Show that X,, is an efficient estimator of p. 


8. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean pu and known 
variance o? > 0. Show that X,, is an efficient estimator 
of p. 


9. Suppose that a single observation X is taken from the 
normal distribution with mean 0 and unknown standard 
deviation o > 0. Find an unbiased estimator of o, deter- 
mine its variance, and show that this variance is greater 
than 1// (co) for every value of o > 0. Note that the value 
of J(o) was found in Exercise 4. 


10. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with mean 0 and unknown stan- 
dard deviation o > 0. Find the lower bound specified by 
the information inequality for the variance of any unbi- 
ased estimator of logo. 


11. Suppose that Xj, ..., X,, form a random sample from 
an exponential family for which the p.d.f. or the p.f. f (x|@) 
is as specified in Exercise 23 of Sec. 7.3. Suppose also that 
the unknown value of 6 must belong to an open interval Q 
of the real line. Show that the estimator T = )~"_, d(X;) 
is an efficient estimator. Hint: Show that T can be repre- 
sented in the form given in Eq. (8.8.15). 
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12. Suppose that Xj, ..., X,, form a random sample from 
a normal distribution for which the mean is known and 
the variance is unknown. Construct an efficient estimator 
that is not identically equal to a constant, and determine 
the expectation and the variance of this estimator. 


13. Determine what is wrong with the following argu- 
ment: Suppose that the random variable X has the uniform 
distribution on the interval [0, 6], where the value of 0 is 
unknown (6 > 0). Then f(x|0) = 1/0,A(x|@) = — log 6 and 
A (x|0) = —(1/6). Therefore, 


1 
10) = Eg’ (XIOP) = 5. 
Since 2X is an unbiased estimator of 6, the information 
inequality states that 


Var(2X) > ee 
1(0) 


But 


ge 
Var(2X) = 4 Var(X) =4- — = — <6”. 
12. 3 
Hence, the information inequality is not correct. 


14. Suppose that X;,..., X,, form a random sample from 
the gamma distribution with parameters a and £, where 
a is unknown and £ is known. Show that if n is large, the 
distribution of the M.L.E. of a will be approximately a 
normal distribution with mean @ and variance 


[r@)P 
n{l (a) (a) — [T'’(@) PF} 


15. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with uknown mean jz and known 
variance o~, and the prior p.d.f. of jz is a positive and dif- 
ferentiable function over the entire real line. Show that if 
nis large, the posterior distribution of jz given that X; = x; 
(i =1,...,) will be approximately a normal distribution 
with mean X,, and variance o7/n. 


16. Suppose that Xj, ..., X,, form a random sample from 
the Bernoulli distribution with unknown parameter p, and 
the prior p.d.f. of p is a positive and differentiable function 
over the interval 0 < p < 1. Suppose, furthermore, that n 
is large, the observed values of X1,..., X,, are xy, ..., X;; 
and 0 <x, < 1. Show that the posterior distribution of p 
will be approximately a normal distribution with mean X,, 
and variance x,,(1 —X,,)/n. 


17. Let X have the binomial distribution with parameters 
n and p. Assume that n is known. Show that the Fisher 
information in X is /(p) =n/[p(1— p)]. 


18. Let X have the negative binomial distribution with 
parameters r and p. Assume that r is known. Show that 
the Fisher information in X is /(p) =r/[p*(1 — p)]. 


19. Let X have the gamma distribution with parameters n 
and @ with 6 unknown. Show that the Fisher information 
in X is 1(0) =n/60?. 


20. Find the Fisher information matrix about p* in Exam- 
ple 8.8.16. 


8.9 Supplementary Exercises 


1. Suppose that X;,..., X, form a random sample from 
the normal distribution with known mean 0 and unknown 
variance o”. Show that ar x] n is the unbiased esti- 
mator of o” that has the smallest possible variance for all 
possible values of o”. 


2. Prove that if X has the ¢ distribution with one degree 
of freedom, then 1/X also has the r distribution with one 
degree of freedom. 


3. Suppose that U and V are independent random vari- 
ables, and that each has the standard normal distribution. 
Show that U/V,U/|V|, and |U|/V each has the ¢ distribu- 
tion with one degree of freedom. 


4. Suppose that X, and X> are independent random vari- 
ables, and that each has the normal distribution with mean 
0 and variance o”. Show that (X; + X)/(X; — X>) has the 
t distribution with one degree of freedom. 


5. Suppose that X,,..., X, form a random sample from 
the exponential distribution with parameter $. Show that 


26 -"_, X; has the x? distribution with 2n degrees of 
freedom. 


6. Suppose that X,..., X, form a random sample from 
an unknown probability distribution P on the real line. 
Let A be a given subset of the real line, and let 9 = P(A). 
Construct an unbiased estimator of 6, and specify its vari- 
ance. 


7. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with mean j1, and variance o”, and 
Y,,..., Y, form an independent random sample from the 
normal distribution with mean jy and variance 20. Let 
So = (4) —X,,)° and 82 = +7 0, —¥,)- 

a. For what pairs of values of a and £ is aS%, + ps an 

unbiased estimator of 07? 
b. Determine the values of aw and # for which as. + 


BS will be an unbiased estimator with minimum 
variance. 


8. Suppose that X,,..., X,,,; form a random sample 
from a normal distribution, and let X,, = + 77_, X; and 


— 71/2 
= E i= X,)?| . Determine the value of a 


n 


constant k such that the random variable k(X 4.1 — X)/Tn 
will have a ¢ distribution. 


9. Suppose that X,,..., X, form a random sample from 
the normal distribution with mean jz and variance o2, and 
Y is an independent random variable having the normal 
distribution with mean 0 and variance 40%. Determine a 
function of X,,..., X,, and Y that does not involve pz or 0” 
but has the ¢ distribution with n — 1 degrees of freedom. 


10. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with mean yw and variance o7, 
where both y and o? are unknown. A confidence interval 
for ys is to be constructed with confidence coefficient 0.90. 
Determine the smallest value of n such that the expected 


squared length of this interval will be less than 07/2. 


11. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean w and un- 
known variance o”. Construct a lower confidence limit 
L(X,,..., X,) for w such that 


Pri > L(Xy,..., X,)] = 0.99. 


12. Consider again the conditions of Exercise 11. Con- 
struct an upper confidence limit U(X,,..., X,) for o7 
such that 


Prio? < U(X),..., X,)]=0.99. 


13. Suppose that Xj, ..., X, forma random sample from 
the normal distribution with unknown mean @ and known 
variance o7. Suppose also that the prior distribution of 6 


is normal with mean yw and variance v?. 


a. Determine the shortest interval J such that Pr(@ € 
I|x1,..., X,) = 0.95, where the probability is calcu- 
lated with respect to the posterior distribution of 6, 
as indicated. 


b. Show that as v? > oo, the interval J converges to 
an interval /* that is a confidence interval for 6 with 
confidence coefficient 0.95. 


14. Suppose that X;,..., X, form a random sample from 
the Poisson distribution with unknown mean 6, and let 
Y= ae X;. 
a. Determine the value of a constant c such that the 
estimator e~°” is an unbiased estimator of e~®. 


b. Use the information inequality to obtain a lower 
bound for the variance of the unbiased estimator 
found in part (a). 


15. Suppose that Xj, ..., X, form a random sample from 
a distribution for which the p.d.f. is as follows: 


F (x10) = Gx’ for Oe x <1, 


0 otherwise, 
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where the value of 6 is unknown (6 > 0). Determine the 
asymptotic distribution of the M.L.E. of 6. (Note: The 
M.L.E. was found in Exercise 9 of Sec. 7.5.) 


16. Suppose that a random variable X has the exponential 
distribution with mean 9, which is unknown (6 > 0). Find 
the Fisher information /(@) in X. 


17. Suppose that X1,..., X,, form a random sample from 
the Bernoulli distribution with unknown parameter p. 
Show that the variance of every unbiased estimator of 
(1 — p)? must be at least 4p(1 — p)3/n. 


18. Suppose that Xj, ..., X,, form a random sample from 
the exponential distribution with unknown parameter 6. 
Construct an efficient estimator that is not identically 
equal to a constant, and determine the expectation and 
the variance of this estimator. 


19. Suppose that Xj, ..., X,, form a random sample from 
the exponential distribution with unknown parameter 6. 
Show that if n is large, the distribution of the M.L.E. of 6 
will be approximately a normal distribution with mean 6 
and variance f2/n. 


20. Consider again the conditions of Exercise 19, and let 
B, denote the M.L.E. of B. 


a. Use the delta method to determine the asymptotic 
distribution of 1/8,,. 


b. Show that 1/8, = X,,, and use the central limit theo- 
rem to determine the asymptotic distribution of 1/£,,. 


21. Let X;,..., X,, be arandom sample from the Poisson 
distribution with mean 0. Let Y = )°"_, X;. 


a. Prove that there is no unbiased estimator of 1/6. 
(Hint: Write the equation that is equivalent to 
Eg(r(X)) = 1/0. Simplify it, and then use what you 
know from calculus of infinite series to show that no 
function r can satisfy the equation.) 

b. Suppose that we wish to estimate 1/0. Consider 
r(Y) =n/(Y + I) as an estimator of 6. Find the bias 
of r(Y), and show that the bias goes to 0 as n > oo. 

c. Use the delta method to find the asymptotic (as n > 
co) distribution of n/(Y + 1). 


22. Let X;,..., X,, be conditionally 1.i.d. with the uniform 
distribution on the interval [0, 6]. Let Y, = max{X,..., 
X,,}- 

a. Find the p.d-f. and the quantile function of Y,,/6. 


b. Y,, is often used as an estimator of 6 even though it 
has bias. Compute the bias of Y,, as an estimator of 6. 


c. Prove that Y,,/0 is a pivotal. 
d. Find a confidence interval for 6 with coefficient y. 
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9.1 Problems of Testing Hypotheses 


In Example 8.3.1 on page 473, we were interested in whether or not the mean 
log-rainfall «x from seeded clouds was greater than some constant, specifically 
4. Hypothesis testing problems are similar in nature to the decision problem of 
Example 8.3.1. In general, hypothesis testing concerns trying to decide whether 
a parameter 6 lies in one subset of the parameter space or in its complement. 
When 0 is one-dimensional, at least one of the two subsets will typically be an 
interval, possibly degenerate. In this section, we introduce the notation and some 
common methodology associated with hypothesis testing. We also demonstrate an 
equivalence between hypothesis tests and confidence intervals. 


The Null and Alternative Hypotheses 


Rain from Seeded Clouds. In Example 8.3.1, we modeled the log-rainfalls from 26 
seeded clouds as normal random variables with unknown mean yw and unknown 
variance o”. Let 6 = (iu, 0”) denote the parameter vector. We are interested in 
whether or not jz > 4. To word this in terms of the parameter vector, we are interested 
in whether or not 6 lies in the set {(, 0”) : 4 > 4}. In Example 8.6.4, we calculated 
the probability that w > 4 as part of a Bayesian analysis. If one does not wish to do 
a Bayesian analysis, one must address the question of whether or not yu > 4 by other 
means, such as those introduced in this chapter. < 


Consider a statistical problem involving a parameter 6 whose value is unknown 
but must lie in a certain parameter space Q. Suppose now that Q can be partitioned 
into two disjoint subsets Qg and Q,, and the statistician is interested in whether 6 lies 
in Qo or in Qy. 

We shall let Hy denote the hypothesis that 6 € Qg and let H; denote the hypothesis 
that 6 € Q,. Since the subsets Qo and Q, are disjoint and Qo U Qy = Q, exactly one 
of the hypotheses Hp and H, must be true. The statistician must decide which of the 
hypotheses Hp or H; appears to be true. A problem of this type, in which there are 
only two possible decisions, is called a problem of testing hypotheses. If the statistician 
makes the wrong decision, he might suffer a certain loss or pay a certain cost. In many 
problems, he will have an opportunity to observe some data before he has to make his 
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decision, and the observed values will provide him with information about the value 
of 6. A procedure for deciding which hypothesis to choose is called a test procedure 
or simply a fest. 

In our discussion up to this point, we have treated the hypotheses Hp and H, 
on an equal basis. In most problems, however, the two hypotheses are treated quite 
differently. 


Null and Alternative Hypotheses/Reject. The hypothesis Hp is called the null hypothesis 
and the hypothesis H, is called the alternative hypothesis. When performing a test, if 
we decide that @ lies in Q,, we are said to reject Ho. If we decide that 6 lies in Q9, we 
are said not to reject Hp. 


The terminology referring to the decisions in Definition 9.1.1 is asymmetric with 
regard to the null and alternative hypotheses. We shall return to this point later in 
the section. 


Egyptian Skulls. Manly (1986, p.4) reports measurements of various dimensions of 
human skulls found in Egypt from various time periods. These data are attributed to 
Thomson and Randall-Maciver (1905). One time period is approximately 4000 B.c. 
We might model the observed breadth measurements (in mm) of the skulls as normal 
random variables with unknown mean yz and variance 26. Interest might lie in how ju 
compares to the breadth of a modern-day skull, about 140mm. The parameter space 
Q could be the positive numbers, and we could let Qg be the interval [140, co) while 
Q, = (0, 140). In this case, we would write the null and alternative hypotheses as 


Ho. > 140, 
My: pu < 140. 


More realistically, we would assume that both the mean and variance of breadth mea- 
surements were unknown. That is, each measurement is a normal random variable 
with mean jz and variance o%. In this case, the parameter would be two-dimensional, 
for example, 0 = (w, 0”). The parameter space Q would then be pairs of real numbers. 
In this case, Qo = [140, co) x (0, co) and Q, = (0, 140) x (0, co), since the hypothe- 
ses only concern the first coordinate . The hypotheses to be tested are the same as 
above, but now yz is only one coordinate of a two-dimensional parameter vector. We 
will address problems of this type in Sec. 9.5. 4 


How did we decide that the null hypothesis should be Ho: u > 140 in Exam- 
ple 9.1.2 rather than jz < 140? Would we be led to the same conclusion either way? 
We can address these issues after we introduce the possible errors that can arise in 
hypothesis testing (Definition 9.1.7). 


Simple and Composite Hypotheses 


Suppose that X;,..., X, form a random sample from a distribution for which the 
p.d.f. or the p.f. is f(x|@), where the value of the parameter 6 must lie in the parameter 
space Q; Qo and Q, are disjoint sets with Qo U Q), = Q; and it is desired to test the 
following hypotheses: 


Ho: Oe Qo, 
Ay: OE Qy. 


For i = 0 or i = 1, the set Q; may contain just a single value of 6 or it might be a 
larger set. 
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Simple and Composite Hypotheses. If Q; contains just a single value of 0, then H; is 
a simple hypothesis. If the set Q; contains more than one value of 6, then H; is a 
composite hypothesis. 


Under a simple hypothesis, the distribution of the observations is completely spec- 
ified. Under a composite hypothesis, it is specified only that the distribution of the 
observations belongs to a certain class. For example, a simple null hypothesis Hj must 
have the form 


One-Sided and Two-Sided Hypotheses. Let 0 be a one-dimensional parameter. One- 
sided null hypotheses are of the form Hp : 6 < 6) or Hy: 0 = 4, with the corresponding 
one-sided alternative hypotheses being H,:6 > 09 or H,:@ < 6. When the null hy- 
pothesis is simple, such as (9.1.1), the alternative hypothesis is usually two-sided, 
Ay :0 # Oo. 


The hypotheses in Example 9.1.2 are one-sided. In Example 9.1.3 (coming up 
shortly), the alternative hypothesis is two-sided. One-sided and two-sided hypotheses 
will be discussed in more detail in Sections 9.3 and 9.4. 


The Critical Region and Test Statistics 


Testing Hypotheses about the Mean of a Normal Distribution with Known Variance. Sup- 
pose that X¥ = (X;,..., X,) is a random sample from the normal distribution with 
unknown mean y and known variance o”. We wish to test the hypotheses 


Ap: “= Uo; 
Ay: [LF Lo. 
It might seem reasonable to reject Hp if X,, is far from jp. For example, we could 
choose a number c and reject Hp if the distance from X,, to 4p is more than c. One 


way to express this is by dividing the set S of all possible data vectors x = (x1, ..., X,) 
(the sample space) into the two sets 


(9.1.2) 


Sy={¥:-c<X,—mMy Sc}, and S,=S¢. 


We then reject Ho if X € S;, and we don’t reject Ho if X¥ € Sy. A simpler way to express 
the procedure is to define the statistic T = |X,, — f4o|, and reject Ho if T >c. < 


In general, consider a problem in which we wish to test the following hypotheses: 
Ho: 0 E Qo, and ,:0 € Q4. (9.1.3) 


Suppose that before the statistician has to decide which hypothesis to choose, she 
can observe a random sample X = (X,,..., X,,) drawn from a distribution that 
involves the unknown parameter 0. We shall let § denote the sample space of the 
n-dimensional random vector X. In other words, S is the set of all possible values of 
the random sample. 

In a problem of this type, the statistician can specify a test procedure by par- 
titioning the sample space S into two subsets. One subset S; contains the values of 
X for which she will reject Hp, and the other subset Sp contains the values of X for 
which she will not reject Ho. 


Critical Region. The set S, defined above is called the critical region of the test. 
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In summary, a test procedure is determined by specifying the critical region of the 
test. The complement of the critical region must then contain all the outcomes for 
which Hp will not be rejected. 

In most hypothesis-testing problems, the critical region is defined in terms of a 
statistic, T=r(X). 


Test Statistic/Rejection Region. Let X be a random sample from a distribution that 
depends on a parameter 6. Let T = r(X) be a statistic, and let R be a subset of the 
real line. Suppose that a test procedure for the hypotheses (9.1.3) is of the form “reject 
Ho if T € R.” Then we call T a test statistic, and we call R the rejection region of the 
test. 


When a test is defined in terms of a test statistic T and rejection region R, as in 
Definition 9.1.5, the set S; = {x :r(x) € R} is the critical region from Definition 9.1.4. 

Typically, the rejection region for a test based on a test statistic T will be some 
fixed interval or the outside of some fixed interval. For example, if the test rejects Ho 
when T > c, the rejection region is the interval [c, oo). Once a test statistic is being 
used, it is simpler to express everything in terms of the test statistic rather than try 
to compute the critical region from Definition 9.1.4. All of the tests in the rest of this 
book will be based on test statistics. Indeed, most of the tests can be written in the 
form “reject Ho if T > c.” (Example 9.1.7 is one of the rare exceptions.) 

In Example 9.1.3, the test statistic is T= |X,, — ol, and the rejection region 
is the interval [c, 00). One can choose a test statistic using intuitive criteria, as in 
Example 9.1.3, or based on theoretical considerations. Some theoretical arguments 
are given in Sections 9.2—9.4 for choosing certain test statistics in a variety of problems 
involving a single parameter. Although these theoretical results provide optimal tests 
in the situations in which they apply, many practical problems do not satisfy the 
conditions required to apply these results. 


Rain from Seeded Clouds. We can formulate the problem described in Example 9.1.1 
as that of testing the hypotheses Hy: <4 versus H,:u>4. We could use the 
same test statistic as in Example 9.1.3. Alternatively, we could use the statistic 
U =n'/?(X,, — 4)/o’, which looks a lot like the random variable from Eq. (8.5.1) 
on which confidence intervals were based. It makes sense, in this case, to reject Ho if 
U is large, since that would correspond to X,, being large compared to 4. < 


Note: Dividing Both Parameter Space and Sample Space. In the various definitions 
given so far, the reader needs to keep straight two different divisions. First, we divided 
the parameter space Q into two disjoint subsets, Qo and Q). Next, we divided the 
sample space S into two disjoint subsets Sy) and S$). These divisions are related to 
each other, but they are not the same. For one thing, the parameter space and the 
sample space usually are of different dimensions, so Qo will necessarily be different 
from Sp. The relation between the two divisions is the following: If the random sample 
X lies in the critical region S,, then we reject the null hypothesis Qo. If X € So, we 
don’t reject Qo. We eventually learn which set Sy or S; contains X. We rarely learn 
which set Qo or Q; contains 0. 


The Power Function and Types of Error 


Let 6 stand for a test procedure of the form discussed earlier in this section, either 
based on a critical region or based on a test statistic. The interesting probabilistic 
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properties of 5 can be summarized by computing, for each value of @ € Q, either the 
probability 7(6|5) that the test 6 will reject Hp or the probability 1 — 7(6|65) that it 
does not reject Ho. 


Power Function. Let 6 be a test procedure. The function 7(6|5) is called the power 
function of the test 5. If S; denotes the critical region of 5, then the power func- 
tion 2 (6|6) is determined by the relation 


1(6|5)=Pr(X €S,)0) for 0€Q. (9.1.4) 


If 5 is described in terms of a test statistic T and rejection region R, the power 
function is 


m(6|8)=Pr(T € RJO) for OER. (9.1.5) 


Since the power function 2 (6|5) specifies, for each possible value of the param- 
eter 6, the probability that 5 will reject Hp, it follows that the ideal power function 
would be one for which 2 (6|5) = 0 for every value of @ € Qo and 2(6|5) = 1 for ev- 
ery value of @ € Q. If the power function of a test 6 actually had these values, then 
regardless of the actual value of 6, 6 would lead to the correct decision with probabil- 
ity 1. In a practical problem, however, there would seldom exist any test procedure 
having this ideal power function. 


Testing Hypotheses about the Mean of a Normal Distribution with Known Variance. In 
Example 9.1.3, the test 5 is based on the test statistic T = |X,, — wo| with rejection 
region R = [c, 00). The distribution of X,, is the normal distribution with mean jz and 
variance o*/n. The parameter is 4 because we have assumed that o? is known. The 
power function can be computed from this distribution. Let @ denote the standard 
normal c.d.f. Then 


Pr(T € Riu) = Pr(X,, = wo + clu) + Pr(X, < Mo — clu) 
fe (nvrtotent) a6 (ni2to=e—#) . 
oO (on 


The final expression above is the power function z(|5). Figure 9.1 plots the power 
functions of three different tests with c = 1, 2,3 in the specific example in which 
fo =4,n =15, and o? = 9. < 


Since the possibility of error exists in virtually every testing problem, we should 
consider what kinds of errors we might make. For each value of 6 € Qo, the decision 


Power function 


zy 


Definition 
9.1.7 


Example 
9.1.6 


9.1 Problems of Testing Hypotheses 535 


to reject Hp is an incorrect decision. Similarly, for each value of 6 € Q, the decision 
not to reject Ho is an incorrect decision. 


Type I/Il Error. An erroneous decision to reject a true null hypothesis is a type I error, 
or an error of the first kind. An erroneous decision not to reject a false null hypothesis 
is called a type II error, or an error of the second kind. 


In terms of the power function, if 9 € Qo, 7(6|65) is the probability that the statistician 
will make a type I error. Similarly, if 9 € Q,, 1 — 1(6|6) is the probability of making a 
type II error. Of course, either 0 € Qo or 6 € Q,, but not both. Hence, only one type 
of error is possible conditional on 6, but we never know which it is. 

If we have our choice between several tests, we would like to choose a test 6 that 
has small probability of error. That is, we would like the power function (6|5) to be 
low for values of 6 € Qo, and we would like z(6|6) to be high for 6 € Q,. Generally, 
these two goals work against each other. That is, if we choose 6 to make z(6|6) small 
for 6 € Qo, we will usually find that 7(6|5) is small for 6 € Q, as well. For example, 
the test procedure 5, that never rejects Hp, regardless of what data are observed, 
will have 2(6|59) = 0 for all 6 € 29. However, for this procedure 2 (6|59) = 0 for all 
6 € Q, as well. Similarly, the test 5, that always rejects Hp will have 2(6|5,) = 1 for all 
6 € Q4, but it will also have 2 (0|5,) = 1 for all 6 € Qo. Hence, there is a need to strike 
an appropriate balance between the two goals of low power in Qo and high power 
in QQ. 

The most popular method for striking a balance between the two goals is to 
choose a number a between 0 and 1 and require that 


m(6|5) <ag, forall @ € Qo. (9.1.6) 


Then, among all tests that satisfy (9.1.6), the statistician seeks a test whose power 
function is as high as can be obtained for @ € Q,. This method is discussed in Sec- 
tions 9.2 and 9.3. Another method of balancing the probabilities of type I and type 
II errors is to minimize a linear combination of the different probabilities of error. 
We shall discuss this method in Sec. 9.2 and again in Sec. 9.8. 


Note: Choosing Null and Alternative Hypotheses. If one chooses to balance type 
I and type II error probabilities by requiring (9.1.6), then one has introduced an 
asymmetry in the treatment of the null and alternative hypotheses. In most testing 
problems, such asymmetry can be quite natural. Generally, one of the two errors 
(type I or type II) is more costly or less palatable in some sense. It would make sense 
to put tighter controls on the probability of the more serious error. For this reason, 
one generally arranges the null and alternative hypotheses so that type I error is the 
error most to be avoided. For cases in which neither hypothesis is naturally the null, 
switching the names of null and alternative hypotheses can have a variety of different 
effects on the results of testing procedures. (See Exercise 21 in this section.) 


Egyptian Skulls. In Example 9.1.2, suppose that the experimenters have a theory 
saying that skull breadths should increase (albeit slightly) over long periods of time. If 
jis the mean breadth of skulls from 4000 B.c. and 140 is the mean breadth of modern- 
day skulls, the theory would say  < 140. The experimenters could mistakenly claim 
that the data support their theory (u < 140) when, in fact, w > 140, or they might 
mistakenly claim that the data fail to support their theory (~ > 140) when, in fact, 
pu < 140. In scientific studies, it is common to treat the false confirmation of one’s own 
theory as a more serious error than falsely failing to confirm ones’ own theory. This 
would mean type I error should be to say that 4 < 140 (confirm the theory, reject 
Ho) when, in fact, w > 140 (theory is false, Hp is true). Traditionally, one includes the 
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endpoints of interval hypotheses in the null, so we would formulate the hypotheses 
to be tested as 


Hp: j= 140, 
Ay: je < 140, 
as we did in Example 9.1.2. < 


The quantities in Eq. (9.1.6) play a fundamental role in hypothesis testing and 
have special names. 


Level/Size. A test that satisfies (9.1.6) is called a level ag test, and we say that the test 
has level of significance ag. In addition, the size a(5) of a test 6 is defined as follows: 


a(S) = sup 7(6|6). (9.1.7) 
GEO 


The following results are immediate consequences of Definition 9.1.8. 


A test 4 is a level ao test if and only if its size is at most ag (i.e., a(6) < ap). If the null 
hypothesis is simple, that is, Ho : 6 = 9, then the size of 6 will be a(6) = (6|6). 


Testing Hypotheses about a Uniform Distribution. Suppose that a random sample 
X,,..., X, is taken from the uniform distribution on the interval [0, 6], where the 
value of @ is unknown (6 > 0); and suppose also that it is desired to test the following 
hypotheses: 


Hp:3 <6 <4, 
(9.1.8) 
M:0 <30rd>4. 
We know from Example 7.5.7 that the M.L.E. of @ is Y, = max{Xj,..., X,}. 


Although Y, must be less than 6, there is a high probability that Y, will be close to 
@ if the sample size n is fairly large. For illustrative purposes, suppose that the test 6 
does not reject Hp if 2.9 < Y, <4, and 6 rejects Hp if Y, does not lie in this interval. 
Thus, the critical region of the test 6 contains all the values of X;,..., X,, for which 
either Y, < 2.9 or Y,, > 4. In terms of the test statistic Y,,, the rejection region is the 
union of two intervals (—oo, 2.9] U [4, 00). 

The power function of 5 is specified by the relation 


(O\5) = Pr(Y,, < 2.9|@) + Pr(y,, > 4/6). 


If 6 < 2.9, then Pr(Y,, < 2.9|0) =1 and Pr(Y,, > 4/6) = 0. Therefore, 7(0|6) =1if 6 < 
2.9. If 2.9 < 6 <4, then Pr(Y,, < 2.9|@) = (2.9/6)” and Pr(Y,, > 4|@) = 0. In this case, 
w(0|5) = (2.9/0)". Finally, if @ > 4, then Pr(Y, < 2.9|@) = (2.9/6)” and Pr(Y,, > 4/0) = 
1 — (4/0). In this case, 2 (6|5) = (2.9/0)" + 1 — (4/6)". The power function z(6|8) is 
sketched in Fig. 9.2. 

By Eq. (9.1.7), the size of 5 is (6) = sup3—g—4 1 (0|5). It can be seen from Fig. 9.2 
and the calculations just given that a(5) = (3/6) = (29/30)". In particular, if the 
sample size is n = 68, then the size of 6 is (29/30)°* = 0.0997. So 4 is a level a test for 
every level of significance ap > 0.0997. < 


Making a Test Have a Specific Significance Level 


Suppose that we wish to test the hypotheses 
Ao: G€ Qo, 
Ay: Ge Q4. 


Figure 9.2 The power func- 
tion 2 (6|5) in Example 9.1.7. 
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Figure 9.3 The p.d-f. of 
Y =X, — mo given pu = Uo 
for Example 9.1.8. The 
shaded areas represent the 
probability that |Y| > c. 
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Let T be a test statistic, and suppose that our test will reject the null hypothesis if 
T > c, for some constant c. Suppose also that we desire our test to have the level of 
significance ag. The power function of our test is 7(0|5) = Pr(T > c|@), and we want 


sup Pr(T > c|@) < ap. (9.1.9) 
8EQH 
It is clear that the power function, and hence the left side of (9.1.9), are nonincreasing 
functions of c. Hence, (9.1.9) will be satisfied for large values of c, but not for small 
values. If we want the power function to be as large as possible for 0 € Q), we 
should make c as small as we can while still satisfying (9.1.9). If T has a continuous 
distribution, then it is usually simple to find an appropriate c. 


Testing Hypotheses about the Mean of a Normal Distribution with Known Variance. In 
Example 9.1.5, our test is to reject Hp: u = Ug if |X, — Mo| = c. Since the null hy- 
pothesis is simple, the left side of (9.1.9) reduces to the probability (assuming that 
[t= Wo) that |X, — zo] = c. Since Y = X,, — uy has the normal distribution with mean 
0 and variance o7/n when ju = sup, we can find a value c that makes the size exactly 
ay for each ag. Figure 9.3 shows the p.d.f. of Y and the size of the test indicated 
as the shaded area under the p.d.f. Since the normal p.d.f. is symmetric around the 
mean (0 in this case), the two shaded areas must be the same, namely, ag/2. This 
means that c must be the 1 — a/2 quantile of the distribution of Y. This quantile is 
c= O11 — a /2)on—"/?, 

When testing hypotheses about the mean of a normal distribution, it is traditional 
to rewrite this test in terms of the statistic 


Xe as 
Z =nil2An— Bo (9.1.10) 
Oo 


Then the test rejects Hp if |Z| > @-!(1 — ao/2). < 


“Y 
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Figure 9.4 Power functions 
of two tests. The plot on the 
left is the power function of 
the test from Example 9.1.8 
with n = 10, wp =5, o = 1, 
and ay = 0.05. The plot on the 
right is the power function of 
the test from Example 9.1.9 
with n = 10, py = 0.3, and 
ap = 0.1. 
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Testing Hypotheses about a Bernoulli Parameter. Suppose that X),..., X, form a 
random sample from the Bernoulli distribution with parameter p. Suppose that we 
wish to test the hypotheses 


Ap: PX Po, 


(9.1.11) 
Ay: P > Po- 


Let Y = )~"_, X;, which has the binomial distribution with parameters n and p. The 
larger p is, the larger we expect Y to be. So, suppose that we choose to reject Hp if 
Y >c, for some constant c. Suppose also that we want the size of the test to be as 
close to ap as possible without exceeding ap. It is easy to check that Pr(Y > c|p) is 
an increasing function of p; hence, the size of the test will be Pr(Y > c|p = po). So, 
c should be the smallest number such that Pr(Y > c|p = po) < a. For example, if 
n= 10, po = 0.3, and ag = 0.1, we can use the table of binomial probabilities in the 
back of this book to determine c. We can compute Seu Pr(Y = y|p = 0.3) = 0.0473 


and ae Pr(Y = y|p = 0.3) = 0.1503. In order to keep the size of the test at most 
0.1, we must choose c > 5. Every value of c in the interval (5, 6] produces the same 
test, since Y takes only integer values. <l 


Whenever we choose a test procedure, we should also examine the power func- 
tion. If one has made a good choice, then the power function should generally be 
larger for 6 € 2, than for 6 € Qo. Also, the power function should increase as 6 moves 
away from (29. For example, Fig. 9.4 shows plots of the power functions for two of the 
examples in this section. In both cases, the power function increases as the parameter 
moves away from Qo. 


The p-value 


Testing Hypotheses about the Mean of a Normal Distribution with Known Variance. In 
Example 9.1.8, suppose that we choose to test the null hypothesis at level a = 0.05. 
We would then compute the test statistic in Eq. (9.1.10) and reject Hp if Z > @-'d — 
0.05/2) = 1.96. For example, suppose that Z = 2.78 is observed. Then we would reject 
Ho. Suppose that we were to report the result by saying that we rejected Hp at level 
0.05. What would another statistician, who felt it more appropriate to test the null 
hypothesis at a different level, be able to do with this report? <1 


Example 
9.1.11 


Definition 
9.1.9 
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The result of a test of hypotheses might appear to be a rather inefficient use of 
our data. For instance, in Example 9.1.10, we decided to reject Ho at level ag = 0.05 
if the statistic Z in Eq. (9.1.10) is at least 1.96. This means that whether we observe 
Z = 1.97 or Z = 6.97, we shall report the same result, namely, that we rejected Ho at 
level 0.05. The report of the test result does not carry any sense of how close we were 
to making the other decision. Furthermore, if another statistician chooses to use a 
size 0.01 test, then she would not reject Hy with Z = 1.97, but she would reject Ho 
with Z = 6.97. What would she do with Z = 2.78? 

For these reasons, an experimenter does not typically choose a value of ap in 
advance of the experiment and then simply report whether or not Hp was rejected 
at level ag. In many fields of application, it has become standard practice to report, 
in addition to the observed value of the appropriate test statistic such as Z, all the 
values of ag for which the level ag test would lead to the rejection of Hp. 


Testing Hypotheses about the Mean of a Normal Distribution with Known Variance. As 
the observed value of Z in Example 9.1.8 is 2.78, the hypothesis Hy would be rejected 
for every level of significance a such that 2.78 > ®-'(1 — ap/2). Using the table 
of the normal distribution given at the end of this book, this inequality translates 
to ag => 0.0054. The value 0.0054 is called the p-value for the observed data and 
the tested hypotheses. Since 0.01 > 0.0054, the statistician who wanted to test the 
hypotheses at level 0.01 would also reject Ho. < 


p-value. In general, the p-value is the smallest level aj such that we would reject the 
null-hypothesis at level ag with the observed data. 


An experimenter who rejects a null hypothesis if and only if the p-value is at most 
ap is using a test with level of significance a. Similarly, an experimenter who wants 
a level a test will reject the null hypothesis if and only if the p-value is at most ap. 
For this reason, the p-value is sometimes called the observed level of significance. 

An experimenter in Example 9.1.10 would typically report that the observed 
value of Z was 2.78 and that the corresponding p-value was 0.0054. It is then said 
that the observed value of Z is just significant at the level of significance 0.0054. One 
advantage to the experimenter of reporting experimental results in this manner is 
that he does not need to select beforehand an arbitrary level of significance ap at 
which to carry out the test. Also, when a reader of the experimenter’s report learns 
that the observed value of Z was just significant at the level of significance 0.0054, 
she immediately knows that Hj would be rejected for every larger value of ag and 
would not be rejected for any smaller value. 


Calculating p-values If all of our tests are of the form “reject the null hypothesis 
when T >c” for a single test statistic T, there is a straightforward way to compute 
p-values. For each f, let 6, be the test that rejects Hy if T > t. Then the p-value when 
T =t is observed is the size of the test 5,. (See Exercise 18.) That is, the p-value equals 


sup 2(6|6,) = sup Pr(T > t|6). (9.1.12) 
8EQH AEH 


Typically, 7(@|6,) is maximized at some 6) on the boundary between Qp and Q). 
Because the p-value is calculated as a probability in the upper tail of the distribution 
of T, it is sometimes called a tail area. 
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Example 
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Theorem 
9.1.1 


Testing Hypotheses about a Bernoulli Parameter. For testing the hypotheses (9.1.11) in 
Example 9.1.9, we used a test that rejects Hp if Y >c. The p-value, when Y = y is 
observed, will be SUP p<po Pr(Y > y|p). In this example, it is easy to see that Pr(Y > 
y|p) increases as a function of p. Hence, the p-value is Pr(Y > y|p = po). For example, 
let py = 0.3 and n = 10. If Y = 6 is observed, then Pr(Y > 6|p = 0.3) = 0.0473, as we 
calculated in Example 9.1.9. 4 


The calculation of the p-value is more complicated when the test cannot be put 
into the form “reject Hp if T > c.” In this text, we shall calculate p-values only for 
tests that do have this form. 


Equivalence of Tests and Confidence Sets 


Rain from Seeded Clouds. In Examples 8.5.5 and 8.5.6, we found a coefficient y one- 
sided (lower limit) confidence interval for 4, the mean log-rainfall from seeded 
clouds. For y = 0.9, the observed interval is (4.727, oo). One of the controversial 
interpretations of this interval is that we have confidence 0.9 (whatever that means) 
that 4. > 4.727. Although this statement is deliberately ambiguous and difficult to 
interpret, it sounds as if it could help us address the problem of testing the hypotheses 
Ho: <4 versus H,: > 4. Does the fact that 4 is not in the observed coefficient 0.9 
confidence interval tell us anything about whether or not we should reject Hp at some 
significance level or other? <l 


We shall now illustrate how confidence intervals (see Sec. 8.5) can be used as an 
alternative method to report the results of a test of hypotheses. In particular, we shall 
show that a coefficient y confidence set (a generalization of confidence interval to 
be defined shortly) can be thought of as a set of null hypotheses that would not be 
rejected at significance level 1 — y. 


Defining Confidence Sets from Tests. Let ¥ = (X,,..., X,,) be arandom sample from 
a distribution that depends on a parameter 6. Let g(6) be a function, and suppose 
that for each possible value go of g(@), there is a level aq test 5,, of the hypotheses 


Ao, :8(9) = 80, Ai,9,:8() #80: (9.1.13) 
For each possible value x of X, define 
w(x) = {go :5,, does not reject Ho, ,, if X =x is observed}. (9.1.14) 
Let y = 1 — ap. Then, the random set w(X) satisfies 
Pr[g (09) € w(X)|@ = 9] = y. (9.1,15) 
for all 6) € Q. 


Proof Let @ be an arbitrary element of , and define gy = g(69). Because 4,, is a 
level ap test, we know that 


Pr[6,, does not reject Ho »|9 = 6] = 1—ay =y. (9.1.16) 


80 


For each x, we know that g(9) € w(x) if and only if the test 6,, does not reject Ho ¢, 
when X = x is observed. It follows that the left-hand side of Eq. (9.1.15) is the same 
as the left-hand side of Eq. (9.1.16). rT] 


Definition 
9.1.10 


Theorem 
9.1.2 


Example 
9.1.14 


Example 
9.1.15 
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Confidence Set. If a random set w(X) satisfies (9.1.15) for every 6) € Q, we call it a 
coefficient y confidence set for g(0). If the inequality in (9.1.15) is equality for all 60, 
then we call the confidence set exact. 


A confidence set is a generalization of the concept of a confidence interval introduced 
in Sec. 8.5. What Theorem 9.1.1 shows is that a collection of level ag tests of the 
hypotheses (9.1.13) can be used to construct a coefficient y = 1 — ag confidence set 
for g(0). The reverse construction is also possible. 


Defining Tests from Confidence Sets. Let X¥ = (X;,..., X,) be arandom sample from 
a distribution that depends on a parameter 0. Let g(6) be a function of 0, and let 
w(X) be a coefficient y confidence set for g(@). For each possible value gg of g(6), 
construct the following test 5,, of the hypotheses in Eq. (9.1.13): 5,, does not reject 
Ap, ¢, if and only if go € o(X). Then 4,, is a level ay = 1 — y test of the hypotheses in 
Eq. (9.1.13). 


Proof Because w(X) is a coefficient y confidence set for g(@), it satisfies Eq. (9.1.15) 
for all 6) € Q. As in the proof of Theorem 9.1.1, the left-hand sides of Eqs. (9.1.15) 
and (9.1.16) are the same, which makes 4,, a level ag test. rT] 


A Confidence Interval for the Mean of a Normal Distribution. Consider the test found 
in Example 9.1.8 for the hypotheses (9.1.2). Let ag = 1 — y. The size az test 5,,, is to 


reject Hy if |X,, — ul => ®- 11 — ag/2)on—'/. If X,, =X, is observed, the set of sup 
such that we would not reject Hp is the set of jz9 such that 


IX, — Mol < o-! (1 - “0 ont? , 
This inequality easily translates to 
X,n- o! (1 _ “0) on? < bo <x, + o! (1 - <0) on V2. 
The coefficient y confidence interval becomes 
(A, B)= (x, pt (1 _ “0 on ?/2, x, + o! (1 — “0 on) . 


It is easy to check that Pr(A < 4p < Blu = Mo) = y for all wo. This confidence interval 
is exact. <i 


Constructing a Test from a Confidence Interval. In Sec. 8.5, we learned how to construct 
a confidence interval for the unknown mean of a normal distribution when the 
variance was also unknown. Let X;,..., X, be a random sample from a normal 
distribution with unknown mean j and unknown variance o?. In this case, the 
parameter is 6 = (yu, o”), and we are interested in g() = w. In Sec. 8.5, we used the 
statistics 


1 n 1 n 1/2 
i,=— 5X, 2 = (4 Yi(X% - z,)) ; (9.1.17) 
PB at eo al 


The coefficient y confidence interval for g(@) is the interval 


eg a flty)\ oe = 1fity)\o 
1 1 
(x, Th ( 2 ) ni/2 $. Xe ar T 4 2 ale oy (9.1.18) 
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where sie) is the quantile function of the ¢ distribution with n — 1 degrees of 
freedom. For each jg, we can use this interval to find a level ag = 1 — y test of the 
hypotheses 

Ap: “= Uo; 

Ay WF Lo. 
The test will reject Hp if jzp is not in the interval (9.1.18). A little algebra shows that 
4 is not in the interval (9.1.18) if and only if 


a a (4*). 
2 


This test is identical to the f test that we shall study in more detail in Sec. 9.5. < 


pli2Xn — Ho 
o’ 


One-Sided Confidence Intervals and Tests Theorems 9.1.1 and 9.1.2 establish the 
equivalence between confidence sets and tests of hypotheses of the form (9.1.13). It 
is often necessary to test other forms of hypotheses, and it would be nice to have 
versions of Theorems 9.1.1 and 9.1.2 to deal with these cases. Example 9.1.13 is one 
such case in which the hypotheses are of the form 


Ap,e,: 89) < 80, A, 9, :8(9) > 80- (9.1.19) 


Theorem 9.1.1 extends immediately to such cases. We leave the proof of Theo- 
rem 9.1.3 to the reader. 


One-Sided Confidence Intervals from One-Sided Tests. Let X = (Xj,..., X,) bea 
random sample from a distribution that depends on a parameter 6. Let g(@) be a 
real-valued function, and suppose that for each possible value go of g(@), there is a 
level ag test 5,, of the hypotheses (9.1.19). For each possible value x of X, define w(x) 
by Eq. (9.1. 14). Let y = 1 — ap. Then the random set w(X) satisfies Eq. (9.1.15) for 
all 6p € Q. a 


One-Sided Confidence Interval for a Bernoulli Parameter. In Example 9.1.9, we showed 
how to construct a level ap test of the one-sided hypotheses (9.1.11). Let Y = )~"_, X; 

The test rejects Ho if Y =>c(po) where c(pp) is the smallest number c such that 
Pr(Y > c|p = po) < ap. After observing the data X, we can check, for each pg, whether 
or not we reject Ho. That is, for each po we check whether or not Y > c(po). All those 
Po for which Y <c(po) (i.e., we don’t reject Hy) will form an interval w(X). This 
interval will satisfy Pr(pp € @(X)|p = po) = 1— ap for all po. For example, suppose 
that n = 10, ag = 0.1, and Y = 6 is observed. In order not to reject Hy: p < po at 
level 0.1, we must have a rejection region that does not contain 6. This will happen 
if and only if Pr(Y > 6|p = pp) > 0.1. By trying various values of po, we find that 
this inequality holds for all pp > 0.3542. So, if Y = 6 is observed, our coefficient 0.9 
confidence interval is (0.3542, 1). Notice that 0.3 is not in the interval, so we would 
reject Hy: p < 0.3 with a level 0.1 test as we did in Example 9.1.9. For other observed 
values Y = y, the confidence intervals will all be of the form (q¢(y), 1) where g(y) can 
be computed as outlined in Exercise 17. For n = 10 and ag = 0.1, the values of g(y) 
are 


y 0 1 2 3 4 5 6 7 8 9 10 


q(y) 0 0.0104 0.0545 0.1158 0.1875 0.2673 0.3542 0.4482 0.5503 0.6631 0.7943 


This confidence interval is not exact. < 


Example 
9.1.17 
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Unfortunately, Theorem 9.1.2 does not immediately extend to one-sided hy- 
potheses for the following reason. The size of a one-sided test for hypotheses of the 
form (9.1.19) depends on all of the values of 6 such that g(@) < gp, not just on those 
for which g(@) = go. In particular, the size of the test 5,, defined in Theorem 9.1.2 is 


sup Pr[gy ¢ w(X)|6]. (9.1.20) 
{0:g(0) <go} 


The confidence coefficient, on the other hand, is 


1— sup Prigo ¢ o(X)|6]. 
{0:3(@)=so} 

If we could prove that the supremum in Eq. (9.1.20) occurred at a @ for which 
g(0) = go, then the size of the test would be 1 minus the confidence coefficient. Most 
of the cases with which we shall deal in this book will have the property that the 
supremum in Eq. (9.1.20) does indeed occur at a0 for which g(6) = gp. Example 9.1.16 
is one such case. Example 9.1.13 is another. The following example is the general 
version of what we need in Example 9.1.13. 


One-Sided Tests and Confidence Intervals for a Normal Mean with Unknown Vari- 
ance. Let X,,..., X, be arandom sample from a normal distribution with unknown 
mean yz and unknown variance o2. Here 6 = (uu, 0”). Let g(9) =. In Theorem 8.5.1, 
we found that 


(x, = sia (v) ni? <} (9.1.21) 


is a one-sided coefficient y confidence interval for g(@). Now, suppose that we use 
this interval to test hypotheses. We shall reject the null hypothesis that uw = [Wg if Uo 
is not in the interval (9.1.21). It is easy to see that jg is not in the interval (9.1.21) 
if and only if X,, > wo +o/n-V 7): Such a test would seem to make sense for 
testing the hypotheses 


Ao: L < Lo. Ay: Lh > Lo. (9.1.22) 


In particular, in Example 9.1.13, the fact that 4 is not in the observed confidence 
interval means that the test constructed above (with wg = 4 and y = 0.9) would reject 
Hp: <4 at level ay = 0.1. S| 


The test constructed in Example 9.1.17 is another ¢ test that we shall study in Sec. 9.5. 
In particular, we will show in Sec. 9.5 that this ¢ test is a level 1 — y test. In Exercise 19, 
you can find the one-sided confidence interval that corresponds to testing the reverse 
hypotheses. 


Likelihood Ratio Tests 


A very popular form of hypothesis test is the likelihood ratio test. We shall give a 
partial theoretical justification for likelihood ratio tests in Sec. 9.2. Such tests are 
based on the likelihood function f,,(x|0). (See Definition 7.2.3 on page 390.) The 
likelihood function tends to be highest near the true value of 6. Indeed, this is why 
maximum likelihood estimation works well in so many cases. Now, suppose that we 
wish to test the hypotheses 


Ho: O€E Qo, 
Ay: O€ Qy. 
In order to compare these two hypotheses, we might wish to see whether the likeli- 
hood function is higher on Qo or on Q), and if not, how much smaller the likelihood 


(9.1.23) 
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function is on Qo. When we computed M.L.E.’s, we maximized the likelihood func- 
tion over the entire parameter space Q. In particular, we calculated supgeg fi, (*10). 
If we restrict attention to Hp, then we can compute the largest value of the likelihood 
among those parameter values in Qo: sup, eQ Sn(¥l@). The ratio of these two suprema 
can then be used for testing the hypotheses (9.1.23). 


Definition Likelihood Ratio Test. The statistic 


9.1.11 su fel) 
A(x) = SUP HEM Jn IP? 
SUPgeg Sn (¥18) 
is called the likelihood ratio statistic. A likelihood ratio test of hypotheses (9.1.23) is 
to reject Hy if A(x) < k for some constant k. 


(9.1.24) 


In words, a likelihood ratio test rejects Hp if the likelihood function on Qa is suffi- 
ciently small compared to the likelihood function on all of Q. Generally, k is chosen 
so that the test has a desired level ap, if that is possible. 


Example Likelihood Ratio Test of Two-Sided Hypotheses about a Bernoulli Parameter. Suppose 
9.1.18 that we shall observe Y, the number of successes in n independent Bernoulli trials 
with unknown parameter @. Consider the hypotheses Hp : 6 = @ versus Hy): 06 4 4%. 

After the value Y = y has been observed, the likelihood function is 


FIO) = (“Jor - ey", 
y 
In this case, Qo = {09} and Q = [0, 1]. The likelihood ratio statistic is 


0 (1 — @)"~» 


A(y) = : 
SUP [0,1] 6x1 — 6)" 


(9.1.25) 


The supremum in the denominator of Eq. (9.1.25) can be found as in Example 7.5.4. 
The maximum occurs where 6 equals the M.L.E., 6 = y/n. So, 


AQ) = (™) (““ - ny | 
y = y 


It is not difficult to see that A(y) is small for y near 0 and near n and largest near 
y =n6o. As aspecific example, suppose that n = 10 and 6) = 0.3. Table 9.1 shows the 11 
possible values of A(y) for y =0,..., 10. If we desired a test with level of significance 
ay, we would order the values of y according to values of A(y) from smallest to largest 
and choose k so that the sum of the probabilities Pr(Y = y|9 = 0.3) corresponding to 
those values of y with A(y) < k was at most ap. For example, if a) = 0.05, we see from 
Table 9.1 that we can add up the probabilities corresponding to y = 10, 9, 8, 7, 0 to 
get 0.039. But if we include y = 6, corresponding to the next smallest value of A(y), 
the sum jumps to 0.076, which is too large. The set of y € {10, 9, 8, 7, 0} corresponds 
to A(y) <k for every k in the half-open interval [0.028, 0.147). The size of the test 
that rejects Hy when y € {10, 9, 8, 7, 0} is 0.039. < 


nS 
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following result, whose precise statement and proof are beyond the scope of this text, 


Likelihood ratio tests are most popular in problems involving large sample sizes. The 
shows how to use them in such cases. 
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Table 9.1 Values of the likelihood ratio statistic in Example 9.1.18 


y 


0 


1 2 3 4 5 6 7 8 9 10 


A(y) 


0.028 0.312 0.773 1.000 0.797 0.418 0.147 0.034 0.005 3x 10-4 6x 10~¢ 


Pr(Y = y|@ =0.3) 


0.028 0.121 0.233 0.267 0.200 0.103 0.037 0.009 0.001 1x10-* 6x 10~¢ 


Theorem 
9.1.4 


Example 
9.1.19 


Large-Sample Likelihood Ratio Tests. Let Q be an open subset of p-dimensional space, 
and suppose that Hp specifies that k coordinates of 6 are equal to k specific values. 
Assume that Ho is true and that the likelihood function satisfies the conditions needed 
to prove that the M.L.E. is asymptotically normal and asymptotically efficient. (See 
page 523.) Then, as n —> 00, —2 log A(X) converges in distribution to the x? distri- 
bution with k degrees of freedom. rT] 


Likelihood Ratio Test of Two-Sided Hypotheses about a Bernoulli Parameter. We shall 
apply the idea in Theorem 9.1.4 to the case at the end of Example 9.1.18. Set Q = (0, 1) 
so that p =1 and k = 1. To get an approximate level ap test, we would reject Hp if 
—2 log A(y) is greater than the 1 — ap quantile of the x7 distribution with one degree 
of freedom. With ap = 0.05, this quantile is 3.841. By taking logarithms of the numbers 
in the A(y) row of Table 9.1, one sees that —2 log A(y) > 3.841 for y € {10, 9, 8, 7, O}. 
Rejecting Hy) when —2 log A(y) > 3.841 is then the same test as we constructed in 
Example 9.1.18. <l 


Theorem 9.1.4 can also be applied if the null hypothesis specifies that a collection 
of k functions of 6 are equal to k specific values. For example, suppose that the param- 
eter is 9 = (, 0”), and we wish to test Hy : (uw — 2)/o =1 versus Hy: (u —2)/o0 #1. 
We could first transform to the equivalent parameter 6’ = ([~ — 2]/o, 0) and then ap- 
ply Theorem 9.1.4. Because of the invariance property of M.L.E.’s (Theorem 7.6.1, 
which extends to multidimensional parameters) one does not actually need to per- 
form the transformation in order to compute A. One merely needs to maximize the 
likelihood function over the two sets Qo and Q and take the ratio. 

On a final note, one must be careful not to apply Theorem 9.1.4 to problems of 
one-sided hypothesis testing. In such cases, the A(X) usually has a distribution that 
is neither discrete nor continuous and doesn’t converge to a x? distribution. Also, 
Theorem 9.1.4 fails to apply when the parameter space © is a closed set and the null 
hypothesis is that 6 takes a value on the boudary of . 


ee 


¢ 


Hypothesis-Testing Terminology 


We noted after Definition 9.1.1 that there is asymmetry in the terminology with 
regard to choosing between hypotheses. Both choices are stated relative to Hp, 
namely, to reject Hy or not to reject Hy. When hypothesis testing was first being 
developed, there was controversy over whether alternative hypotheses should even 
be formulated. Focus centered on null hypotheses and whether or not to reject them. 
The operational meaning of “do not reject Hy” has never been articulated clearly. In 
particular, it does not mean that we should accept Hp as true in any sense. Nor does 
it mean that we are necessarily more confident that Hp is true than that it is false. For 
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that matter, “reject Hy” does not mean that we are more confident that Hp is false 
than that it is true. 

Part of the problem is that hypothesis testing is set up as if it were a statistical 
decision problem, but neither a loss function nor a utility function is involved. Hence, 
we are not weighing the relative likelihoods of various hypotheses against the costs 
or benefits of making various decisions. In Sec. 9.8, we shall illustrate one method for 
treating the hypothesis-testing problem as a statistical decision problem. Many, but 
not all, of the popular testing procedures will turn out to have interpretations in the 
framework of decision problems. In the remainder of this chapter, we shall continue 
to develop the theory of hypothesis testing as it is generally practiced. 

There are two other points of terminology that should be clarified here. The first 
concerns the terms “critical region” and “rejection region.” Readers of other books 
might encounter either of the terms “critical region” or “rejection region” referring 
to either the set S, in Definition 9.1.4 or the set R in Definition 9.1.5. Those books 
generally define only one of the two terms. We choose to give the two sets S; and 
R different names because they are mathematically different objects. One, S), is a 
subset of the set of possible data vectors, while the other, R, is a subset of the set of 
possible values of a test statistic. Each has its use in different parts of the development 
of hypothesis testing. In most practical problems, tests are more easily expressed in 
terms of test statistics and rejection regions. For proving some theorems in Sec. 9.2, 
it is more convenient to define tests in terms of critical regions. 

The final point of terminology concerns the terms “level of significance” and 
“size,” as well as the term “level a test.” Some authors define level of significance 
(or significance level) for a test using a phrase such as “the probability of type I error” 
or “the probability that the data lie in the critical region when the null hypothesis is 
true.” If the null hypothesis is simple, these phrases are easily understood, and they 
match what we defined as the size of the test in such cases. On the other hand, if 
the null hypothesis is composite, such phrases are ill-defined. For each 0 € Qo, there 
will usually be a different probability that the test rejects Hy. Which, if any, is the 
level of significance? We have defined the size of a test to be the supremum of all 
of these probabilities. We have said that the test “has level of significance a,” if the 
size is less than or equal to ag. This means that a test has one size but many levels 
of significance. Every number from the size up to 1 is a level of significance. There 
is a sound reason for distinguishing the concepts of size and level of significance. In 
Example 9.1.9, the investigator wants to constrain the probability of type I error to 
be less than 0.1. The test statistic Y has a discrete distribution, and we saw that no test 
with size 0.1 is available. In that example, the investigator needed to choose a test 
whose size was 0.0473. This test still has level of significance 0.1 and is a level 0.1 test, 
despite having a different size. There are other more complicated situations in which 
one can construct a test 6 that satisfies Eq. (9.1.6), that is, it has level of significance ap, 
but for which it is not possible (without sophisticated numerical methods) to compute 
the actual size. An investigator who insists on using a particular level of significance 
ay can use such a test, and call it a level ag test, without being able to compute its 
size exactly. The most common example of this latter situation is one in which we 
wish to test hypotheses concerning two parameters simultaneously. For example, let 
0 = (61, 6,), and suppose that we wish to test the hypotheses 


Hj:6,=Oand6,=1 versus H,:6; 40 or 6,41 or both. (9.1.26) 


The following result gives a way to contruct a level a test of Ho. 


Theorem 
9.1.5 
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Fori=1,...,n, let Ho; be a null hypothesis, and let 6; be a level ag; test of Hp ;. 
Define the combined null hypothesis Hp that all of Hp), ..., Ho,, are simultaneously 
true. Let 5 be the test that rejects Hp if at least one of 5,,..., 6, rejects its corre- 


sponding null hypothesis. Then 6 is a level }*"_, a; test of Hp. 


Proof Fori=1,...,n,let A; be the event that 5; rejects Hp ;. Apply Theorem 1.5.8. 
7 


To test Ap in (9.1.26), find two tests 6; and 65 such that 6, is a test with size ag/2 for 
testing 6; = 0 versus 6, 4 0 and 5, is a test with size ap)/2 for testing 6) = 1 versus 
0, #1. Let 5 be the test that rejects Hp if either 5, rejects 6; = 0 or 55 rejects 6, =1 
or both. Theorem 9.1.5 says that 4 is a level ag test of Hy versus Hj, but its exact size 
requires us to be able to calculate the probability that both 5, and 45, simultaneously 
reject their corresponding null hypotheses. Such a calculation is often intractable. 
Finally, our definition of level of significance matches nicely with the use of p- 


values, as pointed out immediately after Definition 9.1.9. 


Summary 


Hypothesis testing is the problem of deciding whether @ lies in a particular subset Qp 
of the parameter space or in its complement Q,. The statement that 6 € Qo is called 
the null hypothesis and is denoted by Ho. The alternative hypothesis is the statement 
H,:0 € Q4. If S is the set of all possible data values (vectors) that we might observe, 
a subset 5S; C S is called the critical region of a test of Hp versus H, if we choose to 
reject H) whenever the observed data X are in S, and not reject Hy) whenever X ¢ S}. 
The power function of this test 6 is 7(0|6) = Pr(X € S,|0). The size of the test 6 is 
SUPp ca, 7(A5). A test is said to be a level ag test if its size is at most a. The null 
hypothesis Ho is simple if Qo is a set with only one point; otherwise, Hp is composite. 
Similarly, H, is simple if Q,; has a single point, and Hj is composite otherwise. A type 
I error is rejecting Hy when it is true. A type II error is not rejecting Ho when it is 
false. 

Hypothesis tests are typically constructed by using a test statistic T. The null 
hypothesis is rejected if T lies in some interval or if T lies outside of some interval. 
The interval is chosen to make the test have a desired significance level. The p- 
value is a more informative way to report the results of a test. The p-value can be 
computed easily whenever our test has the form “reject Hp if T > c” for some statistic 
T. The p-value when T = t is observed equals supg.g, Pr(T = 1|@). We also showed 
how a confidence set can be considered as a way of reporting the results of a test of 
hypotheses. A coefficient 1 — ag confidence set for 6 is the set of all 6) € , such that 
we would not reject Hp : @ = 4 using a level a test. These confidence sets are intervals 
when we test hypotheses about a one-dimensional parameter or a one-dimensional 
function of the parameter. 


548 Chapter 9 Testing Hypotheses 
Exercises 


1. Let X have the exponential distribution with parameter 
B. Suppose that we wish to test the hypotheses Hp: f > 1 
versus H, : 8 < 1. Consider the test procedure 6 that rejects 
Ho if X > 1. 

a. Determine the power function of the test. 

b. Compute the size of the test. 


2. Suppose that X),..., X, form a random sample from 
the uniform distribution on the interval [0, 0], and that the 
following hypotheses are to be tested: 


Ho: 0 > 2, 
Ay: 6 <2. 
Let Y, = max{X,,..., X,,}, and consider a test procedure 
such that the critical region contains all the outcomes for 
which Y,, < 1.5. 
a. Determine the power function of the test. 
b. Determine the size of the test. 


3. Suppose that the proportion p of defective items in a 
large population of items is unknown, and that it is desired 
to test the following hypotheses: 


Ho: p= 0.2, 
Ay: Dp # 0.2. 


Suppose also that a random sample of 20 items is drawn 
from the population. Let Y denote the number of defec- 
tive items in the sample, and consider a test procedure 6 
such that the critical region contains all the outcomes for 
which either Y > 7 or Y <1. 


a. Determine the value of the power function z(p|6é) at 
the points p = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 
and 1; sketch the power function. 


b. Determine the size of the test. 


4. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean pw and known 
variance 1. Suppose also that j1p is a certain specified num- 
ber, and that the following hypotheses are to be tested: 


Ho: “=o, 

My: wb # Lo. 
Finally, suppose that the sample size n is 25, and consider a 
test procedure such that Hp is to be rejected if |X,, — ol = 


c. Determine the value of c such that the size of the test 
will be 0.05. 


5. Suppose that X),..., X, form a random sample from 
the normal distribution with unknown mean yw and un- 
known variance o”. Classify each of the following hy- 
potheses as either simple or composite: 
a. Ho: 
b. Ho: 


w=Oando =1 
w>3ando <1 


c« Hy: w=—2ando? <5 


d. Ho: u=0 


6. Suppose that a single observation X is to be taken from 
the uniform distribution on the interval [ _ i, O+ 3], 
and suppose that the following hypotheses are to be 
tested: 

Ho: 0 < 3. 

A: 0= 4. 
Construct a test procedure 6 for which the power function 


has the following values: 7 (6|5) = 0 for 6 < 3 and 7(6|8) = 
1 for 6 > 4. 


7. Return to the situation described in Example 9.1.7. 
Consider a different test 5* that rejects Hp if Y,, < 2.9 or 
Y,, > 4.5. Let 5 be the test described in Example 9.1.7. 

a. Prove that 7(6|6*) = 2(6|6) for all 6 < 4. 

b. Prove that 2(6|6*) < 2(6|6) for all 6 > 4. 


c. Which of the two tests seems better for testing the 
hypotheses (9.1.8)? 


8. Assume that X),..., X, are i.i.d. with the normal dis- 
tribution that has mean yp and variance 1. Suppose that we 
wish to test the hypotheses 


Ho: < bo, 
My: b> Uo. 
Consider the test that rejects Ho if Z >c, where Z is defined 
in Eq. (9.1.10). 
a. Show that Pr(Z > c|jz) is an increasing function of yw. 
b. Find c to make the test have size ag. 


9. Assume that Xj,..., X,, are i.i.d. with the normal dis- 
tribution that has mean yp and variance 1. Suppose that we 
wish to test the hypotheses 


Hp: = Lo, 
Ay: LL < Lo. 
Find a test statistic T such that, for every c, the test 6, that 


rejects Hyp when T > c has power function z(j1|6,.) that is 
decreasing in jw. 


10. In Exercise 8, assume that Z = z is observed. Find a 


formula for the p-value. 


11. Assume that Xj, ..., Xo arei.i.d. having the Bernoulli 
distribution with parameter p. Suppose that we wish to 
test the hypotheses 


Ho: p= 0.4, 
Ay: Dp # 0.4. 


LetY=>?_, Xj. 


a. Find c; and c2 such that 
PriY < cy|p = 0.4) + Pr(Y = co|p = 0.4) 


is as close as possible to 0.1 without being larger than 
0.1. 

b. Let 5 be the test that rejects Hp if either Y <c, or 
Y > cp». What is the size of the test 6? 


c. Draw a graph of the power function of 6. 


12. Consider a single observation X from a Cauchy distri- 
bution centered at 6. That is, the p.d.f. of X is 


1 
f(x|@) = alto ayy’ for 


CO<X< OW. 


Suppose that we wish to test the hypotheses 


Ho: 0< 0, 
A: 0> Oo. 


Let 6, be the test that rejects Hp if X >c. 
a. Show that z(6|6,) is an increasing function of 6. 
b. Find c to make 6, have size 0.05. 
c. If X =x is observed, find a formula for the p-value. 


13. Let X have the Poisson distribution with mean 6. Sup- 
pose that we wish to test the hypotheses 


Ho: 0 < 1.0, 
Ay: 6 > 1.0. 


Let 6, be the test that rejects Hp if X > c. Find c to make 
the size of 5,. as close as possible to 0.1 without being larger 
than 0.1. 


14. Let X;,..., X,, bei.i.d. with the exponential distribu- 
tion with parameter 6. Suppose that we wish to test the 
hypotheses 


Ho: 0> 4, 
A: O< A. 


Let X =}*?_, X;. Let 6, be the test that rejects Hy if X > c. 
a. Show that 2(6|6,) is a decreasing function of 0. 
b. Find c in order to make 6, have size a. 


ce. Let 0) =2,n =1, and ap = 0.1. Find the precise form 
of the test 5, and sketch its power function. 


15. Let X have the uniform distribution on the interval 
[0, 6], and suppose that we wish to test the hypotheses 
Ho: O< 1, 
A: 6>1. 
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We shall consider test procedures of the form “reject Ho 
if X > c.” For each possible value x of X, find the p-value 
if X = x is observed. 


16. Consider the confidence interval found in Exercise 5 
in Sec. 8.5. Find the collection of hypothesis tests that are 
equivalent to this interval. That is, for each c > 0, find 
a test 6, of the null hypothesis Ho. :o” =c versus some 
alternative such that 6, rejects Ho, if and only if c is not 
in the interval. Write the test in terms of a test statistic 
T =r(X) being in or out of some nonrandom interval that 
depends on c. 


17. Let X),..., X, be iid. with a Bernoulli distribu- 
tion that has parameter p. Let Y = }°_, X;. We wish 
to find a coefficient y confidence interval for p of the 
form (q(y), 1). Prove that, if Y = y is observed, then q(y) 
should be chosen to be the smallest value po such that 
Pr(Y > ylp = po) =1-y. 


18. Consider the situation described immediately before 
Eq. (9.1.12). Prove that the expression (9.1.12) equals the 
smallest aw such that we would reject Hp at level of signif- 
icance a. 


19. Return to the situation described in Example 9.1.17. 
Suppose that we wish to test the hypotheses 


Ap: = Mo, 


(9.1.27) 
Ay w< uo 


at level wp. It makes sense to reject Mp if X,, is small. Con- 
struct a one-sided coefficient 1 — ag confidence interval for 
je such that we can reject Hp if fp is not in the interval. 
Make sure that the test formed in this way rejects Hp if X,, 
is small. 


20. Prove Theorem 9.1.3. 


21. Return to the situations described in Example 9.1.17 
and Exercise 19. We wish to compare what might happen 
if we switch the null and alternative hypotheses. That is, we 
want to compare the results of testing the hypotheses in 
(9.1.22) at level a to the results of testing the hypotheses 
in (9.1.27) at level ap. 


a. Let ag < 0.5. Prove that there are no possible data 
sets such that we would reject both of the null hy- 
potheses simultaneously. That is, for every possible 
Xin and o’, we must fail to reject at least one of the 
two null hypotheses. 


b. Letag < 0.5. Prove that there are data sets that would 
lead to failing to reject both null hypotheses. Also 
prove that there are data sets that would lead to 
rejecting each of the null hypotheses while failing to 
reject the other. 


c. Leta > 0.5. Prove that there are data sets that would 
lead to rejecting both null hypotheses. 
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Example 
9.2.1 


Figure 9.5 Graphs of the 
two competing p.d.f’s in 
Example 9.2.1 with n = 1. 


* 9.2 Testing Simple Hypotheses 


The simplest hypothesis-testing situation is that in which there are only two possible 
values of the parameter. In such cases, it is possible to identify a collection of test 
procedures that have certain optimal properties. 


Introduction 


Service Times in a Queue. In Example 3.7.5, we modeled the service times X = 


(X,,..., X,) of n customers in a queue as having the joint distribution with joint 
p.d.f. 
! 
—— for all xj > 0, 21 
AiG) =) (2+ 7, x) (9.2.1) 
0 otherwise. 


Suppose that a service manager is not sure how well this joint distribution describes 
the service times. As an alternative, she proposes to model the service times as a 
random sample of exponential random variables with parameter 1/2. This model says 
that the joint p.d.f is 


1 1 n 
Pst 5a exp (-3 dX, «| for all x; > 0, (9.2.2) 
0 otherwise. 


For illustration, Fig. 9.5 shows both of these p.d.f.’s for the case of n = 1. If the manager 
observes several service times, how can she test which of the two distributions appears 
to describe the data? < 


In this section, we shall consider problems of testing hypotheses in which a vector 
of observations comes from one of two possible joint distributions, and the statistician 
must decide from which distribution the vector actually came. In many problems, 
each of the two joint distributions is actually the distribution of a random sample 
from a univariate distribution. However, nothing that we present in this section will 
depend on whether or not the observations form a random sample. In Example 9.2.1, 
one of the joint distributions is that of a random sample, but the other is not. In 
problems of this type, the parameter space Q contains exactly two points, and both 
the null hypothesis and the alternative hypothesis are simple. 


Example 
9.2.2 
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Specifically, we shall assume that the random vector X = (X;,..., X,,) comes 
from a distribution for which the joint p.d-f., p-f., or p.f./p.d-f. is either fo(x) or f; (x). To 
correspond with notation earlier and later in the book, we can introduce a parameter 
space Q = {0 , 0,;} and let 6 = 6, stand for the case in which the data have p.d.f, p-f., 
or p.f./p.d.f. f;(v) for i = 0, 1. We are then interested in testing the following simple 
hypotheses: 


Ho: 0= 4, 


9.2.3 
Ay: 6= 64. ( ) 


In this case, 29 = {69} and Q, = {6,} are both singleton sets. 
For the special case in which X is a random sample from a distribution with 
univariate p.d.f. or p.f. f(x|0), we then have, for i =0 ori = 1, 


F,() = F116) f 219) >> > Fn 195). 


The Two Types of Errors 


When a test of the hypotheses (9.2.3) is being carried out, we have special notation 
for the probabilities of type I and type I errors. For each test procedure 6, we shall 
let a(6) denote the probability of an error of type I and shall let 6(5) denote the 
probability of an error of type I. Thus, 


a(5) = Pr(Rejecting Ho|O = >), 
B(6) = Pr(Not Rejecting Ho|@ = 64). 


Service Times in a Queue. The manager in Example 9.2.1 looks at the two p.d.f’s 
in Fig. 9.5 and decides that f; gives higher probability to large service times than 
does fp. So she decides to reject Hy : 9 = Op if the service times are large. Specifically, 
suppose that she observes n = 1 service time, X,. The test 5 that she chooses rejects 
Ho if X, => 4. The two error probabilities can be calculated from the two different 
possible distributions of X,. Given 6 = 6, X, has the exponential distribution with 
parameter 0.5. The c.d.f. of this distribution is Fo(x) = 1 — exp(—0.5x) for x > 0. The 
type I error probability is the probability that X, > 4, which equals a(6) = 0.135. 
Given @ = 6}, the distribution of X, has the p.d.f. 2/(2 + x1)* for x, > 0. The c.d.f. is 
then F\(x) = 1—2/(24+ x), for x > 0. The type II error probability is 6(5) = Pr(x, < 
4) = F,(4) = 0.667. < 


It is desirable to find a test procedure for which the probabilities a(6) and 6(6) 
of the two types of error will be small. For a given sample size, it is typically not 
possible to find a test procedure for which both a (6) and 8 (6) will be arbitrarily small. 
Therefore, we shall now show how to construct a procedure for which the value of a 
specific linear combination of a and £ will be minimized. 


Optimal Tests 


Minimizing a Linear Combination Suppose that a and b are specified positive 
constants, and it is desired to find a procedure 5 for which aa(6) + bB(6) will be a 
minimum. Theorem 9.2.1 shows that a procedure that is optimal in this sense has a 
very simple form. In Sec. 9.8, we shall give a rationale for choosing a test to minimize 
a linear combination of the error probabilities. 
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Theorem 
9.2.1 


Corollary 
9.2.1 


Let 5* denote a test procedure such that the hypothesis Hp is not rejected if afo(x) > 
bf, (x) and the hypothesis Hp is rejected if afo(x) < bf,{(x). The null hypothesis Hy can 
be either rejected or not if afp(v) = bf,(x). Then for every other test procedure 64, 


ac(s*) + bB(8*) < aa(s) + bB(s). (9.2.4) 


Proof Forconvenience, we shall present the proof for a problem in which the random 
sample Xj,..., X,,is drawn from a discrete distribution. In this case, f;(x) represents 
the joint p.f. of the observations in the sample when 4; is true (i = 0, 1). If the sample 
comes from a continuous distribution, in which case f;(x) is a joint p.d.f., then each 
of the sums that will appear in this proof should be replaced by an n-dimensional 
integral. 

If we let S,; denote the critical region of an arbitrary test procedure 6, then S$; 
contains every sample outcome x for which 6 specifies that Hy should be rejected, and 
So = Sf contains every outcome x for which Hp should not be rejected. Therefore, 


aa(5) + bB(S) =a) fox) +b D> fiw) 


xeSy xESo 
=a )) fo) +b) 1-9) A@ (9.2.5) 
xeS1 xe Sy 
=b+ ) [afo(x) — bA@)] 
xeS; 


It follows from Eq. (9.2.5) that the value of the linear combination aa (6) + bB(6) 
will be a minimum if the critical region S, is chosen so that the value of the final 
summation in Eq. (9.2.5) isa minimum. Furthermore, the value of this summation will 
be a minimum if the summation includes every point x for which afp(x) — bf, (x) < 0 
and includes no point x for which afo(x) — bf; (x) > 0. In other words, aa(5) + bB(6) 
will be a minimum if the critical region S; is chosen to include every point x such 
that afo(x) < bf,(x) and exclude every point x such that this inequality is reversed. 
If afy(x) = bf,(x) for some point x, then it is irrelevant whether or not x is included 
in S,, because the corresponding term would contribute zero to the final summation 
in Eq. (9.2.5). The critical region described above corresponds to the test procedure 
5* defined in the statement of the theorem. a 


The ratio f,(x)/fo(x) is sometimes called the likelihood ratio of the sample. 
It is related to, but not the same as, the likelihood ratio statistic from Defini- 
tion 9.1.11. In the present context, the likelihood ratio statistic A(x) would equal 
fo(x)/ max{ fo(x), f,(x)}. In particular, the likelihood ratio f,(x)/fo(x) is large when 
A(x) is small, and vice versa. In fact, 


Aw). 
A(x) = ( Le) if fo(x) < fi(x) 


1 otherwise. 


The important point to remember about this confusing choice of names is the follow- 
ing: The theoretical justification for tests based on the likelihood ratio defined here 
(provided in Theorems 9.2.1 and 9.2.2) is the rationale for expecting the likelihood 
ratio tests of Defintion 9.1.11 to be sensible. 

When a, b > 0, Theorem 9.2.1 can be reworded as follows. 


Assume the conditions of Theorem 9.2.1, and assume that a > 0 and b > 0. Then 
the test 6 for which the value of aa(5) + b6(5) is a minimum rejects Hy when the 


Example 
9.2.3 


Theorem 
9.2.2 
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likelihood ratio exceeds a/b and does not reject Hp when the likelihood ratio is less 
than a/b. a 


Service Times in a Queue. Instead of rejecting Ho if X; > 4 in Example 9.2.2, the 
manager could apply Theorem 9.2.1. She must choose two numbers a and b to balance 
the two types of error. Suppose that she chooses them to be equal to each other. Then 
the test will be to reject Hp if f,(x1)/fo(x1) > 1. That is, if 


xy 
Otay exp ( ; ) >1. (9.2.6) 
At x; = 0 the left side of Eq. (9.2.6) equals 1, and it decreases until x; = 2 and then 
increases ever after. Hence, Eq. (9.2.6) holds for all values of x; > c where c is the 
unique strictly positive value where the left side of Eq. (9.2.6) equals 1. By numerical 
approximation, we find that this value is x; = 5.025725. The type I and type II error 
probabilities for the test 5* that rejects Ho if X; > 5.025725 are 


o(5*) = 1 — Fo(5.025725) = exp(—2.513) = 0.081, 


2 
6°) = F\ (5.025725) = 1 —- —— =0.715. 
BO") = Fi ) 7006 
The sum of these error probabilities is 0.796. By comparison, the sum of the two error 
probabilities in Example 9.2.2 is 0.802, slightly higher. J 


Minimizing the Probability of an Error of Type II Next, suppose that the proba- 
bility a(6) of an error of type I is not permitted to be greater than a specified level of 
significance, and it is desired to find a procedure 6 for which 6 (6) will be a minimum. 
In this problem, we can apply the following result, which is closely related to Theo- 
rem 9.2.1 and is known as the Nayman-Pearson lemma in honor of the statisticians J. 
Neyman and E. S. Pearson, who developed these ideas in 1933. 


Nayman-Pearson lemma. Suppose that 6’ is a test procedure that has the following 
form for some constant k > 0: The hypothesis Hp is not rejected if f,(x) < kfo(x) and 
the hypothesis Hp is rejected if f,(x) > kfp(x). The null hypothesis Hp can be either 
rejected or not if f,(v) =kfo(x). If 5 is another test procedure such that a(S) < a(6’), 
then it follows that 6(6) > B(6’). Furthermore, if w(5) < a(6’), then B(5) > B(5’). 


Proof From the description of the procedure 5’ and from Theorem 9.2.1, it follows 
that for every test procedure 6, 


ka(5’) + B(S') < ka() + B(6). (9.2.7) 
If a(6) < a(5’), then it follows from the relation (9.2.7) that B(6) > B(6’). Also, if 
a(d) < a(6’), then it follows that B(5) > B(6’). | 


To illustrate the use of the Nayman-Pearson lemma, we shall suppose that a 
statistician wishes to use a test procedure for which (6) = ap and f(6) is a minimum. 
According to the lemma, she should try to find a value of k for which a(5’) = ap. The 
procedure 6’ will then have the minimum possible value of (6). If the distribution 
from which the sample is taken is continuous, then it is usually (but not always) 
possible to find a value of k such that w(8’) is equal to a specified value such as ag. 
However, if the distribution from which the sample is taken is discrete, then it is 
typically not possible to choose k so that a(6’) is equal to a specified value. These 
remarks are considered further in the following examples and in the exercises at the 
end of this section. 
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Example 
9.2.4 


Example 
9.2.5 


Service Times in a Queue. In Example 9.2.3, the distribution of X, is continuous, 
and we can find a value k such that the test 5’ that results from Theorem 9.2.2 has 
a(5") = 0.07, say. The test 6* in Example 9.2.3 has a(6*) > 0.07 andk = 1. We will need 
a larger value of k in order to get the type I error probability down to 0.07. As we 
noted in Example 9.2.3, the left side of Eq. (9.2.6) is increasing for x, > 2, and hence 
the set of x, values such that 


4 xy 
Otay exp ( 5 ) >k (9.2.8) 


will be an interval of the form (c, co) where c is the unique value that makes the 
left side of Eq. (9.2.8) equal to k. The resulting test will then have the form “reject 
Hy if X; >c.” At this point, we don’t care any more about k because we just need 
to choose c to make sure that Pr(X, > c|@ = 09) = 0.07. That is, we need 1 — Fo(c) = 
0.07. Recall that Fo(c) = 1 — exp(—0.5c), so c = —2 log(0.07) = 5.318. We can then 
compute £(5') = F,(5.318) = 0.727. This test is very close to 6* from Example 9.2.3. 

4 


Random Sample from a Normal Distribution. Suppose that X¥ = (X,,..., X,,) isa ran- 
dom sample from the normal distribution with unknown mean 6 and known variance 
1, and the following hypotheses are to be tested: 
Ho: G= 0, 
Ay: 6=1. 
We shall begin by determining a test procedure for which 8(6) will be a minimum 
among all test procedures for which a(6) < 0.05. 

When Hp is true, the variables X,,..., X, form arandom sample from the stan- 
dard normal distribution. When H, is true, these variables form a random sample 
from the normal distribution for which both the mean and the variance are 1. There- 
fore, 


(9.2.9) 


1 1 n ‘i 
and 
f(x) = P65) me i (9.2.11) 
1 ~~ (27 )n/2 p 2 = 1 . Le 

After some algebraic simplification, the likelihood ratio f;(x)/fo(x) can be written 
in the form 

ge exp|n(s, ie 5)] (9.2.12) 

fo) 2 


It now follows from Eq. (9.2.12) that rejecting the hypothesis Hp when the likelihood 
ratio is greater than a specified positive constant k is equivalent to rejecting Hy when 
the sample mean x, is greater than (1/2) + (1/n) log k. 

Let k’ = (1/2) + (1/n) log k, and suppose that we can find a value of k’ such that 


Pr (x, > k'|0 = 0) = 0.05. (9.2.13) 


Then the procedure 65’, which rejects Hy when X,, > k’, will satisfy a(5’) = 0.05. 
Furthermore, by the Nayman-Pearson lemma, 8’ will be an optimal procedure in the 
sense of minimizing the value of (6) among all procedures for which a(é) < 0.05. 


Example 
9.2.6 
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It is easy to see that the value of k’ that satisfies Eq. (9.2.13) must be the 0.95 
quantile of the distribution of X,, given 6 = 0. When 6 = 0, the distribution of X,, is 
the normal distribution with mean 0 and variance 1/n. Therefore, its 0.95 quantile 
is 0+ &!(0.95)n—!/2, where ©! is the standard normal quantile function. From a 
table of the standard normal distribution, it is found that the 0.95 quantile of the 
standard normal distribution is 1.645, so k’ = 1.645n~!/2. 

In summary, among all test procedures for which a(4) < 0.05, the procedure that 
rejects Hy when X,, > 1.645n~/ has the smallest probability of type II error. 

Next, we shall determine the probability 6(6’) of an error of type II for this 
procedure 6’. Since 6(6’) is the probability of not rejecting Hp when H is true, 


B(5’) = Pr(X,, < 1.645n~'/7|6 = 1). (9.2.14) 
When @ = 1, the distribution of X,, is the normal distribution with mean 1 and variance 


1/n. The probability in Eq. (9.2.14) can then be written as 


1/2 _ 
BS’) = o( Le) = (1.645 — n'/). (9.2.15) 
= 


For instance, when n = 9, it is found from a table of the standard normal distribution 
that 


B(8’) = &(—1.355) = 1 — (1.355) = 0.0877. 


Finally, for this same random sample and the same hypotheses (9.2.9), we shall 
determine the test procedure 59 for which the value of 2a(6) + 6(6) is a minimum, 
and we shall calculate the value of 2a@(59) + 6(69) when n = 9. 

It follows from Theorem 9.2.1 that the procedure 46) for which 2a(6) + 6(6) is a 
minimum rejects Ho when the likelihood ratio is greater than 2. By Eq. (9.2.12), this 
procedure is equivalent to rejecting Hy when X,, > (1/2) + (1/n) log 2. Thus, when 
n =9, the optimal procedure 59 rejects Hy when X,, > 0.577. For this procedure we 
then have 


a(5y) = Pr(X,, > 0.577|6 = 0) (9.2.16) 
and 
B(8p) = Pr(X,, < 0.577|6 = 1). (9.2.17) 


Since X,, has the normal distribution with mean 6 and variance 1/n, we have 


i= 1= o( 27) = 1 — 0(1.731) = 0.0417 
1/3 

and 

(Bo) = (Si . *) = ©(—1.269) = 0.1022. 

1/3 

The minimum value of 2a(6) + B(6) is therefore 

20a(59) + B(S9) = 2(0.0417) + (0.1022) = 0.1856. < 
Sampling from a Bernoulli Distribution. Suppose that X,,..., X,, form a random sam- 


ple from the Bernoulli distribution with unknown parameter p, and the following 
hypotheses are to be tested: 
Ao: p= 0:2; 


9.2.18 
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It is desired to find a test procedure for which w(5) = 0.05 and 6(6) is a minimum. 
In this example, each observed value x; must be either 0 or 1. If we let y = )°"_, x;, 


then the joint p.f. of X,,..., X,, when p = 0.2 is 

f(x) = (0.2)*(0.8)"* (9.2.19) 
and the joint p.f. when p = 0.4 is 

f(x) = (0.4)" 0.6)". (9.2.20) 
Hence, the likelihood ratio is 

fa (a) G3) 22) 


It follows that rejecting Hg when the likelihood ratio is greater than a specified 
positive constant k is equivalent to rejecting Hp when y is greater than k’, where 
ge log k +n log(4/3) 
log(8/3) : 
To find a test procedure for which (5) = 0.05 and 6(65) is a minimum, we use the 


Nayman-Pearson lemma. If we let Y = )*"_, X;, we should try to find a value of k’ 
such that 


(9.2.22) 


Pr(Y > k'|p =0.2) =0.05. (9.2.23) 


When the hypothesis Hp is true, the random variable Y has the binomial distri- 
bution with parameters n and p = 0.2. However, because of the discreteness of this 
distribution, it generally will not be possible to find a value of k’ for which Eq. (9.2.23) 
is satisfied. For example, suppose that n = 10. Then it is found from a table of the 
binomial distribution that Pr(Y > 4|p = 0.2) = 0.0328 and also Pr(Y > 3|p =0.2) = 
0.1209. Therefore, there is no critical region of the desired form for which a(6) = 0.05. 
If it is desired to use a level 0.05 test 6 based on the likelihood ratio as specified by 
the Nayman-Pearson lemma, then one must reject Hy) when Y > 4 and a(6) = 0.0328. 

< 


Randomized Tests 


It has been emphasized by some statisticians that a(6) can be made exactly 0.05 in 
Example 9.2.6 if a randomized test procedure is used. Such a procedure is described 
as follows: When the rejection region of the test procedure contains all values of y 
greater than 4, we found in Example 9.2.6 that the size of the test is a(5) = 0.0328. 
Also, when the point y = 4 is added to this rejection region, the value of a(6) jumps to 
0.1209. Suppose, however, that instead of choosing between including the point y = 4 
in the rejection region and excluding that point, we use an auxiliary randomization 
to decide whether or not to reject Hy) when y = 4. For example, we may toss a coin or 
spin a wheel to arrive at this decision. Then, by choosing appropriate probabilities 
to be used in this randomization, we can make a(6) exactly 0.05. 

Specifically, consider the following test procedure: The hypothesis Hp is rejected 
if y > 4, and Ap is not rejected if y < 4. However, if y = 4, then an auxiliary random- 
ization is carried out in which Hp will be rejected with probability 0.195, and Hp will 
not be rejected with probability 0.805. The size a(6) of this test will then be 


a(5) = Pr(Y > 4|p =0.2) + (0.195) Pr(Y = 4|p = 0.2) 
= 0.0328 + (0.195)(0.0881) = 0.05. (9.2.24) 


Pearson lemma. 
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Randomized tests do not seem to have any place in practical applications of 
statistics. It does not seem reasonable for a statistician to decide whether or not 
to reject a null hypothesis by tossing a coin or performing some other type of 
randomization for the sole purpose of obtaining a value of w(3) that is equal to some 
arbitrarily specified value such as 0.05. The main consideration for the statistician is 
to use a nonrandomized test procedure 6’ having the form specified in the Nayman- 


The proofs of Theorems 9.2.1 and 9.2.2 can be extended to find optimal tests 
among all tests regardless of whether they are randomized or nonrandomized. The 
optimal test in the extension of Theorem 9.2.2 has the same form as 4* except that 
randomization is allowed whenever f(x) = kfo(x). The only real need for random- 
ized tests, in this book, will be the simplification that they provide for one step in the 
proof of Theorem 9.3.1 (page 562). 

Furthermore, rather than fixing a specific size a(6) and trying to minimize (5), 
it might be more reasonable for the statistician to minimize a linear combination of 
the form aa(é) + bB(5). As we have seen in Theorem 9.2.1, such a minimization can 
always be achieved without recourse to an auxiliary randomization. In Sec. 9.9, we 
shall present another argument that indicates why it might be more reasonable to 
minimize a linear combination of the form aa(é) + b6(5) than to specify a value of 
a(6) and then minimize (6). 


o, 


Summary 


“ 


For the special case in which there are only two possible values, 6) and 6, for 
the parameter, we found a collection of procedures for testing Hp: = 64) versus 
H,:0 = 0, that contains the optimal test procedure for each of the following criteria: 


¢ Choose the test 6 with the smallest value of aw(é) + bB(6). 
¢ Among all tests 6 with a(6) < ag, choose the test with the smallest value of 6(65). 


Here, a(6) = Pr(Reject Hp|6 = 4) and (6) = Pr(Don’t Reject Hp|6 = 6,) are, re- 
spectively, the probabilities of type I and type II errors. The tests all have the fol- 
lowing form for some positive constant k: reject Hp if fo(x) < kf,\(x), don’t reject Ho 
if fo(x) > kKf,(x), and do either if fo) = kf, (x). 


Exercises 


1. Let fo(x) be the p.f. of the Bernoulli distribution with 
parameter 0.3, and let f;(x) be the p.f. of the Bernoulli 
distribution with parameter 0.6. Suppose that a single ob- 
servation X is taken from a distribution for which the p.d.f. 
f(x) is either fo(x) or f\(x), and the following simple hy- 
potheses are to be tested: 


Ho: f(x) = fo), 
Ay: f(x) = fi@). 


Find the test procedure 6 for which the value of a(6) + (6) 
is a minimum. 


2. Consider two p.d.f’s fo(x) and f;(x) that are defined as 
follows: 


fon) { for0<x <1, 
we 0 otherwise, 
and 
2x for0<x <1, 
fix) = ; 
0 otherwise. 


Suppose that a single observation X is taken from a dis- 
tribution for which the p.d.f. f(x) is either fo(x) or f,(x), 
and the following simple hypotheses are to be tested: 


Ho: f(x) = fo), 
Ay: f(x) = fi@). 
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a. Describe a test procedure for which the value of 
a(6) + 26(6) is a minimum. 

b. Determine the minimum value of a(6) + 26(6) at- 
tained by that procedure. 


3. Consider again the conditions of Exercise 2, but sup- 
pose now that it is desired to find a test procedure for 
which the value of 3a(6) + £(4) is a minimum. 

a. Describe the procedure. 


b. Determine the minimum value of 3a(6) + B(6) at- 
tained by the procedure. 


4. Consider again the conditions of Exercise 2, but sup- 
pose now that it is desired to find a test procedure for 
which a(6) < 0.1 and A(6) is a minimum. 

a. Describe the procedure. 


b. Determine the minimum value of (6) attained by 
the procedure. 


5. Suppose that X,,..., X, form a random sample from 
the normal distribution with unknown mean @ and known 
variance is 1, and the following hypotheses are to be 
tested: 


Ho: G= 3D; 
Ay: 6 =5.0. 
a. Among all test procedures for which 6(5) < 0.05, de- 
scribe a procedure for which a(6) is a minimum. 


b. For n =4, find the minimum value of a(6) attained 
by the procedure described in part (a). 


6. Suppose that X,,..., X,, form a random sample from 
the Bernoulli distribution with unknown parameter p. Let 
Po and p, be specified values such that 0 < p; < po < 1, 
and suppose that it is desired to test the following simple 
hypotheses: 


Hy: P= Po, 
Ay p=Ppy. 
a. Show that a test procedure for which a(6) + B(6) isa 
minimum rejects Hy when X,, <c. 
b. Find the value of the constant c. 


7. Suppose that X;,..., X, form a random sample from 
the normal distribution with known mean ju and unknown 
variance o7, and the following simple hypotheses are to be 
tested: 


Ho: o? = 2. 

Ay: o2 =3. 

a. Show that among all test procedures for which @(5) < 
0.05, the value of 6(5) is minimized by a procedure 
that rejects Hy when )7"_,(X; — 4)? >c. 


b. Forn = 8, find the value of the constant c that appears 
in part (a). 


8. Suppose that a single observation X is taken from the 
uniform distribution on the interval [0, 6], where the value 
of 6 is unknown, and the following simple hypotheses are 
to be tested: 


Ho: 0= 1, 
Ay: = 2. 


a. Show that there exists a test procedure for which 
a(5) =0 and B(S) <1. 

b. Among all test procedures for which (5) = 0, find 
the one for which 6(5) is a minimum. 


9. Suppose that a random sample Xy,..., X, is drawn 
from the uniform distribution on the interval [0, 6], and 
consider again the problem of testing the simple hypothe- 
ses described in Exercise 8. Find the minimum value of 
f(s) that can be attained among all test procedures for 
which a(5) = 0. 


10. Suppose that Xj, ..., X,, form a random sample from 
the Poisson distribution with unknown mean A. Let Ap and 
A, be specified values such that A; > Ag > 0, and suppose 
that it is desired to test the following simple hypotheses: 


A: A= Xo> 
Ay: A= M4. 
a. Show that the value of a(S) + £(6) is minimized by a 
test procedure which rejects Hy when X,, > c. 
b. Find the value of c. 


ce For 49 = 1/4, 4, =1/2, and n = 20, determine the 
minimum value of a(S) + (6) that can be attained. 


11. Suppose that X;,..., X,, form a random sample from 
the normal distribution with unknown mean pw and known 
standard deviation 2, and the following simple hypotheses 
are to be tested: 


Ho: b= —1, 
Ay: h= 1. 
Determine the minimum value of a(5) + 6(5) that can be 


attained for each of the following values of the sample 
size n: 


an=1 bn=4 ec n=16 dad. n=36 


12. Let X;,..., X, be a random sample from the expo- 
nential distribution with unknown parameter @. Let 0 < 
4 < 4, be two possible values of the parameter. Suppose 
that we wish to test the following hypotheses: 


Ho: 0= 4, 
Ay: 6= 04. 


For each a € (0, 1), show that among all tests 6 satisfying 
a(d) <q, the test with the smallest probability of type I 
error will reject Ho if )"_, X; < c, where c is the ap quan- 
tile of the gamma distribution with parameters n and 4. 
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13. Consider the series of examples in this section con- c. Prove that the distribution of T, given that Hp is true, 
cerning service times in a queue. Suppose that the man- is the gamma distribution with parameters 2 and 1/2. 
ager observes two service times X; and X3. It is easy to d. Using Theorem 9.2.2, determine the test procedure 
see that both f,(*) in (9.2.1) and f(x) in (9.2.2) depend with level at most 0.01 that has minimum probability 
on the observed data only through the value ¢ = x; + x2 of type I error. Hint: It looks like you need to solve 
of the statistic T = X; + Xz. Hence, the tests from Theo- a system of nonlinear equations, but for a level 0.01 
rems 9.2.1 and 9.2.2 both depend only on the value of T. test, the equations collapse to a single simple equa- 
a. Using Theorem 9.2.1, determine the test procedure tion. 
that minimizes the sum of the probabilities of type I e. Suppose that X; =4 and X> =3 are observed. Per- 


and type IJ errors. 


form the test in part (d) to see whether Ap is rejected. 


b. Suppose that X; = 4 and X, =3 are observed. Per- 
form the test in part (a) to see whether Hp is rejected. 


Example 
9.3.1 
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When the null and/or alternative hypothesis is composite, we can still find a class of 
tests that has optimal properties in certain circumstances. In particular, the null and 
alternative hypotheses must be of the form Hy: 0 <9 and H,:0 > 6, or Hy: 0 = 4% 
and H,:@ <6. In addition, the family of distributions of the data must have a 
property called “monotone likelihood ratio,” which is defined in this section. 


Definition of a Uniformly Most Powerful Test 


Service Times in a Queue. In Example 9.2.1, a manager was interested in testing 
which of two joint distributions described the service times in a queue that she was 
managing. Suppose, now, that instead of considering only two joint distributions, 
the manager wishes to consider all of the joint distributions that can be described by 
saying that the service times form a random sample from the exponential distribution 
with parameter @ conditional on 6. That is, for each possible rate 6 > 0, the manager 
is willing to consider the possibility that the service times are i.i.d. exponential 
random variables with parameter 0. In particular, the manager is interested in testing 
Ho: <1/2 versus H,:6 > 1/2. For each 6’ > 1/2, the manager could use the methods 
of Sec. 9.2 to test the hypotheses Hj :6 = 1/2 versus H; :@ = 6’. She could obtain the 
level ag test with the smallest possible type II error probability when 6 = 6’. But can 
she find a single level ap test that has the largest possible type II error probability 
simultaneously for all 6 > 1/2? And will that test have probability of type I error at 
most qq for all 6 < 1/2? 4 


Consider a problem of testing hypotheses in which the random variables X = 
(X1,..., X,) form a random sample from a distribution for which either the p.d-f. or 
the p.f. is f(x|0). We suppose that the value of the parameter 6 is unknown but must 
lie in a specified parameter space Q that is a subset of the real line. As usual, we shall 
suppose that Qj, and 2, are disjoint subsets of Q, and the hypotheses to be tested are 


Ho: O€ Qo, 
Ay: OE Qy. 
We shall assume that the subset Q, contains at least two distinct values of 0, in which 
case the alternative hypothesis H; is composite. The null hypothesis Hp may be either 


simple or composite. Example 9.3.1 is of the type just described with Qo = (0, 1/2] 
and Q, = (1/2, oo). 


(9.3.1) 
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Definition 
9.3.1 


Example 
9.3.2 


Definition 
9.3.2 


We shall also suppose that it is desired to test the hypotheses (9.3.1) at a specified 
level of significance a, where a is a given number in the interval 0 < ap < 1. In other 
words, we shall consider only procedures in which Pr(Rejecting Hp|@) < ap for every 
value of 6 € Qo. If 2(@|5) denotes the power function of a given test procedure 64, this 
requirement can be written simply as 


1(6|5)<aq for@ € Qo. (9.3.2) 


Equivalently, if @(6) denotes the size of a test procedure 4, as defined by Eq. (9.1.7), 
then the requirement (9.3.2) can also be expressed by the relation 


a(S) < ap. (9.3.3) 


Finally, among all test procedures that satisfy the requirement (9.3.3), we want to 
find one that has the smallest possible probability of type I error for every 0 € Q). 
In terms of the power function, we want the value of 1 (0|5) to be as large as possible 
for every value of 6 € Q). 

It may not be possible to satisfy this last criterion. If 6; and 6, are two different 
values of 6 in 2), then the test procedure for which the value of 2 (6;|5) is a maximum 
might be different from the test procedure for which the value of 7 (63|5) is a maxi- 
mum. In other words, there might be no single test procedure 6 that maximizes the 
power function 2 (6|6) simultaneously for every value of @ in Q,. In some problems, 
however, there will exist a test procedure that satisfies this criterion. Such a proce- 
dure, when it exists, is called a uniformly most powerful test, or, more briefly, a UMP 
test. The formal definition of a UMP test is as follows. 


Uniformly Most Powerful (UMP) Test. A test procedure 6* is a uniformly most powerful 
(UMP) test of the hypotheses (9.3.1) at the level of significance a if w(6*) < ap and, 
for every other test procedure 6 such that a(5) < ap, it is true that 


m(0|5) <2 (0|6*) for every value of 0 € Qy. (9.3.4) 


In this section, we shall show that a UMP test exists in many problems in which the 
random sample comes from one of the standard families of distributions that we have 
been considering in this book. 


Monotone Likelihood Ratio 


Service Times in a Queue. Suppose that the manager in Example 9.3.1 observes a 
random sample X = (X,..., X,,) of service times and tries to find the level ap test 
of Hj :6 =1/2 versus H; :6 = 6’ that has the largest power at 6 = 6’ > 1/2. According 
to Exercise 12 in Sec. 9.2, the test will reject Hj if )>"_, X; is less than the a quantile of 
the gamma distribution with parameters n and 1/2. This test is the same test regardless 
of which 6’ > 1/2 the manager considers. Hence, the test is UMP at the level of 
significance ag for testing Hj :6 = 1/2 versus H,:6 > 1/2. <i 


The family of exponential distributions in Example 9.3.2 has a special property 
called monotone likelihood ratio that allows the manager to find a UMP test. 


Monotone Likelihood Ratio. Let f,,(x|0) denote the joint p.d.f. or the joint p.f. of the 
observations X = (Xj,..., X,). Let T =r(X) be a statistic. It is said that the joint 
distribution of X has a monotone likelihood ratio (MLR) in the statistic T if the 
following property is satisfied: For every two values 6, € Q and 63 € Q, with 0; < 45, 
the ratio f,,(¥|92)/f,,(«|0,) depends on the vector x only through the function r(x), 


Example 
9.3.3 


Example 
9.3.4 


Example 
9.3.5 
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and this ratio is amonotone function of r(x) over the range of possible values of r(x). 
Specifically, if the ratio is increasing, we say that the distribution of X has increasing 
MLR, and if the ratio is decreasing, we say that the distribution has decreasing MLR. 


Sampling from a Bernoulli Distribution. Suppose that X,,..., X,, form a random sam- 
ple from the Bernoulli distribution with unknown parameter p (0 < p < 1). If we let 
y=)°)_, x; then the joint p.f. f,(x|p) is as follows: 
Sn@lp) = p> = py". 
Therefore, for every two values p, and p> such that 0 < p; < p2 <1, 
fa1Po) _ E = po (; = pe) . (93.5) 
frx@lp) Leid— pt \1- pi 


It can be seen from Eq. (9.3.5) that the ratio f,,(x|p2)/f,,(x|p1) depends on the vectorx 
only through the value of y, and this ratio is an increasing function of y. Therefore, 
f,(*|p) has increasing monotone likelihood ratio in the statistic Y = }*"_, X;. 4 


Sampling from an Exponential Distribution. Let ¥ = (X,,..., X,,) be arandom sample 

from the exponential distribution with unknown parameter 6 > 0. The joint p.d.f. is 
n = n ‘ F 

f,(x10) = : exp (—0 )-"_, x;) for all x; > 0, 


otherwise. 


For 0 < 6; < 0, we have 
ful102) (2) : 
ae! = [| =) ex 6,-—0 x;]), (9.3.6) 
Fnteley ~ 4a,) *P (8) D, 


if all x; > 0. If we let r(v) = )~”_, x;, then we see that the ratio in Eq. (9.3.6) depends 
on x only through r(x) and is a decreasing function of r(x). Hence, the joint distri- 
bution of a random sample of exponential random variables has decreasing MLR in 


In Example 9.3.4, we could have defined the statistic T’ = — }7"_, X; or T’= 
1/ >°"_, X;, and then the distribution would have had increasing MLR in 7’. This 
can be done in general in Definition 9.3.2. For this reason, when we prove theorems 
that assume that a distribution has MLR, we shall state and prove the theorems 
for increasing MLR only. When a distribution has decreasing MLR, the reader can 
transform the statistic by a strictly decreasing function and then transform the result 
back to the original statistic, if desired. 


Sampling from a Normal Distribution. Suppose that Xj, ..., X, form arandom sample 
from the normal distribution with unknown mean pu (—oo < ww < oo) and known 
variance o”. The joint p.d.f. f,,(x|/2) is as follows: 


1 1 n ‘ 
x = ex : . 
fn(¥lH) (Onyragn P| 52 B (x; — W) 
Therefore, for every two values jz; and jz such that py < fo, 


InClu) _ yp [nGa-wy) [- _ 1 
fn ly) = o2 E sua +a), (9.3.7) 
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It can be seen from Eq. (9.3.7) that the ratio f,,(%|u2)/f,(¢|41) depends on the 
vector x only through the value of x,,, and this ratio is an increasing function of x,,. 
Therefore, f,,(%|) has increasing monotone likelihood ratio in the statistic X,. << 


One-Sided Alternatives 


In Example 9.3.2, we found a UMP level ao test for a simple null hypothesis Hj :6 = 
1/2 against a one-sided alternative H, :6 > 1/2. It is more common in such problems 
to test hypotheses of the form 


Hy: 8 < 4%, 

Ay: 06> 6. 
That is, both the null and alternative hypotheses are one-sided. Because the one- 
sided null hypothesis is larger than the simple null Hj : 6 = 6, it is not necessarily 
the case that a level a test of Hj will be a level a test of Hp. However, if the joint 
distribution of the observations has MLR, we will be able to show that there will exist 
UMP level ag tests of the hypotheses (9.3.8). Furthermore (see Exercise 12), there 
will exist UMP tests of the hypotheses obtained by reversing the inequalities in both 
Ho and Hy, in (9.3.8). 


(9.3.8) 


Suppose that the joint distribution of X has increasing monotone likelihood ratio in 
the statistic T =r(X). Let c and ap be constants such that 


Pr(T >cl@= Ao) = Qo. (9.3.9) 


Then the test procedure 4* that rejects Hp if T > c is a UMP test of the hypotheses 
(9.3.8) at the level of significance ap. Also, 7(0|65*) is a monotone increasing function 
of 6. 


Proof Let 6’ < 6” be arbitrary values of 0. Let aj = 7(6’|5*). It follows from the 
Nayman-Pearson lemma that among all procedures 6 for which 


1 (6'|3) <a, (9.3.10) 


the value of 2 (0”|6) will be maximized (1 — 2(6”|6) minimized) by a procedure that 
rejects Hy when f,,(x|0”)/f,,(x|0’) = k. The constant k is to be chosen so that 


1 (6'|5) =a". (9.3.11) 


Because the distribution of X has increasing MLR, the likelihood ratio f,(x|0”)/ 
f,(x|0’) is an increasing function of r(v). Therefore, a procedure that rejects Ho 
when the likelihood ratio is at least equal to k will be equivalent to a procedure 
that rejects Hy when r(x) is at least equal to some other number c. The value of c 
is to be chosen so that (9.3.11) holds. The test 5* satisfies Eq. (9.3.11) and has the 
correct form; hence, it maximizes the power function at 6 = 0” among all tests that 
satisfy Eq. (9.3.10). Another test 5 that satisfies Eq. (9.3.10) is the following: Flip a 
coin that has probability of heads equal to a/, and reject Hp if the coin lands heads. 
This test has 2 (6|6) = a, for all 6 including 6’ and 6”. Because 5* maximizes the power 
function at 6”, we have 


1 (0"|5*) > 1 (6"|5) = a) = 1 (6'[5"). (9.3.12) 


Hence, we have proven the claim that 2 (0|6*) is a monotone increasing function of 6. 
Next, consider the special case of what we have just proven with 6’ = 69. Then 
a) = ay, and we have proven that, for every 6” > 69, 65* maximizes (05) among all 
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9.3.7 


Example 
9.3.8 


9.3 Uniformly Most Powerful Tests 563 


tests 6 that satisfy 


Every level ag test 5 satisfies Eq. (9.3.13). Hence, 5* has power at 6” at least as high 
as the power of every level a test. All that remains to complete the proof is to show 
that 6* is itself a level ap test. 

We have already shown that the power function z(6|5*) is monotone increasing. 
Hence, 2 (0|6*) < ap for all 6 < 6p, and 5* is a level aq test. a 


Service Times in a Queue. The manager in Example 9.3.2 might be interested in the 
hypotheses Hj: 4 <1/2 versus H,:0 > 1/2. The distribution in that example has 
decreasing MLR in the statistic T = )~"_, X;, and hence it has increasing MLR in —T. 
Theorem 9.3.1 says that a UMP level ap test is to reject Hy when —T is greater than the 
1 — a quantile of the distribution of —T given 6 = 1/2. This is the same as rejecting 
Hy when T is less than the a quantile of the distribution of T. The distribution of 
T given 6 = 1/2 is the gamma distribution with parameters n and 1/2, which is also 
the x? distribution with 2n degrees of freedom. For example, if n = 10 and ag = 0.1, 
the quantile is 12.44, which can be found in the table in the back of the book or from 
computer software. < 


Testing Hypotheses about the Proportion of Defective Items. Suppose that the propor- 
tion p of defective items in a large manufactured lot is unknown, 20 items are to be 
selected at random from the lot and inspected, and the following hypotheses are to 
be tested: 
Ho: p< 0.1, 
Ay: p> 0.1. 


We shall show first that there exist UMP tests of the hypotheses (9.3.14). We shall 
then determine the form of these tests and discuss the different levels of significance 
that can be attained with nonrandomized tests. 

Let Xj, ..., X29 denote the 20 random variables in the sample. Then Xj, ..., X29 
form a random sample of size 20 from the Bernoulli distribution with parameter p, 
and it is known from Example 9.3.3 that the joint p.f. of X;, ..., X99 has increasing 
monotone likelihood ratio in the statistic Y = a X;. Therefore, by Theorem 9.3.1, 
a test procedure that rejects Hy) when Y >c will be a UMP test of the hypothe- 
ses (9.3.14). 

For each specific choice of the constant c, the size of the UMP test will be 
a =Pr(Y >c|p =0.1). When p = 0.1, the random variable Y has the binomial dis- 
tribution with parameters n = 20 and p=0.1. Because Y has a discrete distribu- 
tion and assumes only a finite number of different possible values, it follows that 
there are only a finite number of different possible values for a. To illustrate this 
remark, it is found from a table of the binomial distribution that if c=7, then 
ay = Pr(Y > 7|p = 0.1) = 0.0024, and if c = 6, then ag = Pr(Y > 6|p = 0.1) = 0.0113. 
Therefore, if an experimenter wants the size of the test to be approximately 0.01, 
she could choose either c = 7 and a = 0.0024 or c = 6 and ag = 0.0113. The test with 
c =7isa level 0.01 test while the test with c = 6 is not, because the size of the former 
test is less than 0.01 while the size of the latter test is greater than 0.01. 

If the experimenter wants the size of the test to be exactly 0.01, then she can use 
a randomized test procedure of the type described in Sec. 9.2. 4 


(9.3.14) 


Testing Hypotheses about the Mean of a Normal Distribution. Let X,,..., X,, forma 
random sample from the normal distribution with mean jw and variance o7. Assume 
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Figure 9.6 The power func- 
tion z(j4|5;) for the UMP test 
of the hypotheses (9.3.15). 


that o” is known. Let jg be a specified number, and suppose that the following 
hypotheses are to be tested: 


Ay: <M, 

Hy: b> Uo. 
We shall show first that, for every specified level of significance ag (0 < ap < 1), there 
isa UMP test of the hypotheses (9.3.15) with size equal to ay. We shall then determine 
the power function of the UMP test. 

It is known from Example 9.3.5 that the joint p.d.f. of X;,..., X,, has an increas- 
ing monotone likelihood ratio in the statistic X,,. Therefore, by Theorem 9.3.1, a test 
procedure 4, that rejects Hy when X,, > c is a UMP test of the hypotheses (9.3.15). 
The size of this test is ag = Pr(X,, > clu = m9). 

Since X,, has a continuous distribution, c is the 1 — ag quantile of the distribution 
of X,, given (4 = [Ug. That is, c is the 1 — a quantile of the normal distribution with 
mean jug and variance o*/n. As we learned in Chapter 5, this quantile is 


(9.3.15) 


c=Uy +0 '1—apon"?, (9.3.16) 


where ®~! is the quantile function of the standard normal distribution. For simplicity, 
we shall let z,, = ©~!(1 — ap) for the rest of this example. 

We shall now determine the power function z(;2|6,) of this UMP test. By defini- 
tion, 


m(|51) = Pr(Rejecting Hol) = Pr(X, > uo +24,0n /7|n). (9.3.17) 


For every value of jz, the random variable Z’ = n'/?(X,, — 1)/o will have the stan- 
dard normal distribution. Therefore, if @ denotes the c.d.f. of the standard normal 
distribution, then 


i 
(1151) = Pe 2 > Lay + uaa | 


oO 
(9.3.18) 
1/2 = 1/2( — 
is | én P ima | = of Het ~ an 
oO oO 
The power function z(,2|6,) is sketched in Fig. 9.6. < 


In each of the pairs of hypotheses (9.3.8), (9.3.14), and (9.3.15), the alternative 
hypothesis Hj is called a one-sided alternative because the set of possible values of 
the parameter under H, lies entirely on one side of the set of possible values under 
the null hypothesis Ho. In particular, for the hypotheses (9.3.8), (9.3.14), or (9.3.15), 
every possible value of the parameter under Hy is larger than every possible value 
under Ap. 


Figure 9.7 The power func- 
tion 2 (2|62) for the UMP test 
of the hypotheses (9.3.19). 
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One-Sided Alternatives in the Other Direction. Suppose now that instead of testing 
the hypotheses (9.3.15) in Example 9.3.8, we are interested in testing the following 
hypotheses: 


Ay: 2 lo, 
Ay: LL < Lo. 
In this case, the hypothesis H; is again a one-sided alternative, and it can be shown 
(see Exercise 12) that there exists a UMP test of the hypotheses (9.3.19) at every 
specified level of significance ap (0 < a < 1). By analogy with Eq. (9.3.16), the UMP 


test 55 will reject Hy) when X,, < c, where 


(9.3.19) 


c=p) — ® 11 —ap)on "”. (9.3.20) 


The power function (j1|52) of the test 65 will be 


1/2 
m(w|59) = Pr(X, < clu) = of war -old— “0 (9.3.21) 
This function is sketched in Fig. 9.7. Indeed, Exercise 12 extends Theorem 9.3.1 to 
one-sided hypotheses of the form (9.3.19) in every monotone likelihood ratio family. 
In Sec. 9.8, we shall show that for all one-sided cases with monotone likelihood ratio, 
the tests of the form given in Theorem 9.3.1 and Exercise 12 are also optimal when 
one focuses on the posterior distribution of 6 rather than on the power function. < 


Two-Sided Alternatives 


Suppose, finally, that instead of testing either the hypotheses (9.3.15) in Example 9.3.8 
or the hypotheses (9.3.19), we are interested in testing the following hypotheses: 


Ho: = Lo; 

Ay: UF Mo. 
In this case, Hp is a simple hypothesis and H, is a two-sided alternative. Since Hp is a 
simple hypothesis, the size of every test procedure 6 will simply be equal to the value 
It({t9|5) of the power function at the point 4 = po. 

Indeed, for each a (0 < ap < 1), there is no UMP test of the hypotheses (9.3.22) 
at level of significance ap. For every value of jz such that w > jg, the value of 2 (42|6) 
will be maximized by the test procedure 5, in Example 9.3.8, whereas for every value 
of yz such that pz < j1, the value of z (jz|5) will be maximized by the test procedure 5) in 
Example 9.3.9. It can be shown (see Exercise 19) that 5, is essentially the unique test 
that maximizes z(j|5) for uw > fp. Since 6, does not maximize z(j2|5) for jz < (4g, NO 
test could maximize z (j|5) simultaneously for pu > jp and jw < jo. In the next section, 
we shall discuss the selection of an appropriate test procedure in this problem. 


(9.3.22) 
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Summary 


A uniformly most powerful (UMP) level ap test is a level wp test whose power function 
on the alternative hypothesis is always at least as high as the power function of every 
level ap test. If the family of distributions for the data has a monotone likelihood 
ratio in a statistic 7, and if the null and alternative hypotheses are both one-sided, 
then there exists a UMP level ap test. In these cases, the UMP level ap test is either 
of the form “reject Ho if T >c” or “reject Ho if T <c.” 


Exercises 


1. Suppose that X;,..., X, form a random sample from 
the Poisson distribution with unknown mean 2 (A > 0). 
Show that the joint p.f. of X;,..., X, has a monotone 
likelihood ratio in the statistic )°"_, X;. 


2. Suppose that X;,..., X, form a random sample from 
the normal distribution with known mean mw and un- 
known variance o? (o” > 0). Show that the joint p.d-f. of 
X\,..., X, hasa monotone likelihood ratio in the statistic 
Tia% — wy). 

3. Suppose that X),..., X, form a random sample from 
the gamma distribution with parameters a and 6. Assume 
that w is unknown (a > 0) and that £ is known. Show that 


the joint p.d.f. of X;,..., X, has a monotone likelihood 
ratio in the statistic []}_, X;. 


4. Suppose that X,,..., X, form a random sample from 
the gamma distribution with parameters a and 6. Assume 
that a is known and that 6 is unknown (6 > 0). Show that 
the joint p.d.f. of X;,..., X, has a monotone likelihood 


ratio in the statistic —X,,. 


5. Suppose that X,,..., X, form a random sample from 
a distribution that belongs to an exponential family, as 
defined in Exercise 23 of Sec. 7.3, and the p.d-f. or the 
p.f. of this distribution is f(x|@), as given in that exercise. 
Suppose also that c(@) is a strictly increasing function of 0. 
Show that the joint p.d.f. or the joint p.f. of X;,..., X,, has 
a monotone likelihood ratio in the statistic }*_, d(X;). 


6. Suppose that X,,..., X, form a random sample from 
the uniform distribution on the interval [0, 6]. Show that 
the joint p.d.-f. of Xy,..., X, has a monotone likelihood 
ratio in the statistic max{X,..., X}. 


7. Suppose that X;,..., X, form a random sample from 
a distribution involving a parameter 0 whose value is un- 
known, and suppose that it is desired to test the following 
hypotheses: 

Ho: 0< 4, 

A: O> Op. 


Suppose also that the test procedure to be used ignores 
the observed values in the sample and, instead, depends 
only on an auxiliary randomization in which an unbal- 
anced coin is tossed so that a head will be obtained with 


probability 0.05, and a tail will be obtained with proba- 
bility 0.95. If a head is obtained, then Hp is rejected, and 
if a tail is obtained, then Hp is not rejected. Describe the 
power function of this randomized test procedure. 


8. Suppose that X;,..., X, form a random sample from 
the normal distribution with known mean 0 and unknown 
variance o”, and suppose that it is desired to test the 
following hypotheses: 


Ho: o2 <2, 
Ay: o2 > 2. 


Show that there exists a UMP test of these hypotheses at 
every level of significance ag (0 < ag < 1). 


9. Show that the UMP test in Exercise 8 rejects Hy when 


per De >c, and determine the value of c when n = 10 
and ag = 0.05. 


10. Suppose that X;,..., X,, form a random sample from 
the Bernoulli distribution with unknown parameter p, and 
suppose that it is desired to test the following hypotheses: 


Ay: p> 


Show that if the sample size is n = 20, then there exists a 
nonrandomized UMP test of these hypotheses at the level 
of significance ag = 0.0577 and at the level of significance 
ay = 0.0207. 


11. Suppose that Xj, ..., X,, form a random sample from 
the Poisson distribution with unknown mean A, and sup- 
pose that it is desired to test the following hypotheses: 


Ho: n< 1, 
A: Ad, 


Show that if the sample size is n = 10, then there exists a 
nonrandomized UMP test of these hypotheses at the level 
of significance a = 0.0143. 


12. Suppose that Xj, ..., X,, form a random sample from 
a distribution that involves a parameter 6 whose value is 
unknown, and the joint p.d.f. or the joint p.f. f,(v|@) has a 
monotone likelihood ratio in the statistic T = r(X). Let 4 


be a specified value of 6, and suppose that the following 
hypotheses are to be tested: 


Ho: 60> 6, 
Ay: O< Op. 


Let c be aconstant such that Pr(T < c|@ = 0) = ao. Show 
that the test procedure which rejects Hp if T < c isa UMP 
test at the level of significance ap. 


13. Suppose that four observations are taken at random 
from the normal distribution with unknown mean yp and 
known variance 1. Suppose also that the following hy- 
potheses are to be tested: 


Ho: L= 10, 
A: bh< 10. 


a. Determine a UMP test at the level of significance 
ap = 0.1. 
b. Determine the power of this test when ps = 9. 


c. Determine the probability of not rejecting Hp if uw = 
11. 


14. Suppose that Xj, ..., X, forma random sample from 
the Poisson distribution with unknown mean 4, and sup- 
pose that it is desired to test the following hypotheses: 


Ho: Xr ea iL 
Ay: <1. 


Suppose also that the sample size is n = 10. At what levels 
of significance ap in the interval 0 < ag < 0.03 do there 
exist nonrandomized UMP tests? 


15. Suppose that Xj, ..., X,, form a random sample from 
the exponential distribution with unknown parameter £6, 
and suppose that it is desired to test the following hypothe- 
Ses: 


Hy: B25, 
Hy p<4. 


Show that at every level of significance ag (0 < a < 1), 
there exists a UMP test that specifies rejecting Hy) when 


X,, =>, for some constant c. 


16. Consider again the conditions of Exercise 15, and sup- 
pose that the sample size is n = 10. Determine the value 
of the constant c that defines the UMP test at the level of 
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significance oy = 0.05. Hint: Use the table of the x? distri- 
bution. 


17. Consider a single observation X from the Cauchy dis- 
tribution with unknown location parameter 6. That is, the 
p.d.f. of X is 


1 
0)= for 
De a Oy) 
Suppose that it is desired to test the following hypotheses: 
Ho: 0= 0, 
Ay: 6>0. 


O<X< OW. 


Show that, for every ag (0 < a < 1), there does not exist 
a UMP test of these hypotheses at level of significance ag. 


18. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean pw and known 
variance 1. Suppose also that the following hypotheses are 
to be tested: 


Ho: Us 0, 
Ay: u> 0. 


Let 5* denote the UMP test of these hypotheses at the 
level of significance wp = 0.025, and let (j|5*) denote the 
power function of 5*. 


a. Determine the smallest value of the sample size n for 
which z(j4|5*) > 0.9 for pu > 0.5. 


b. Determine the smallest value of n for which 
zr (|5*) < 0.001 for x < —0.1. 


19. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean pw and known 
variance o”. In this problem, you will prove the missing 
steps from the proof that there is no UMP level ap test 
for the hypotheses in (9.3.22). Let 5, be the test procedure 
with level ag defined in Example 9.3.8. 


a. Let A be a set of possible values for the random vec- 
tor X = (X,..., X,,). Let wu, # uo. Prove that Pr(X € 
Alu = eo) > Oif and only if Pr(X¥ € Alu = pw) > 0. 


b. Let 5 be a size ap test for the hypotheses in (9.3.22) 
that differs from 6, in the following sense: There is 
a set A for which 6 rejects its null hypothesis when 
X € A, 5; does not reject its null hypothesis when X € 
A, and Pr(X € Alu = fo) > 0. Prove that m(|5) < 
m(j4|54) for all w > Lo. 


*9.4 Two-Sided Alternatives 


When testing a simple null hypothesis against a two-sided alternative (as at the 
end of Sec. 9.3), the choice of a test procedure requires a bit more care than in the 
one-sided case. This section discusses some of the issues and describes the most 


common choices. 
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Example 
9.4.1 


General Form of the Procedure 


Egyptian Skulls. In Example 9.1.2, we considered how to compare measurements of 
skulls found in Egypt to modern measurements. For example, the average breadth 
of a modern-day skull is about 140mm. Suppose that we model the breadths of 
skulls from 4000 B.c. as normal random variables with unknown mean yu and known 
variance of 26. Unlike Example 9.1.6, suppose now that the researchers have no 
theory suggesting that skull breadths should increase over time. Instead, they are 
merely interested in whether breadths changed at all. How would they choose a test 
of the hypotheses Ho : 4. = 140 versus H;: ~ 4 140? <1 


In this section, we shall suppose that X = (X,,..., X,,) is arandom sample from a 
normal distribution for which the mean jz is unknown and the variance o? is known, 
and that it is desired to test the following hypotheses: 


Ho: =o; 
Ay: [LF Lo. 


In most practical problems, we would assume that both y and o? were unknown. We 
shall address that case in Sec. 9.5. 

It was claimed at the end of Sec. 9.3 that there is no UMP test of the hypothe- 
ses (9.4.1) at any specified level of significance ap (0 < ag < 1). Neither the test pro- 
cedure 5; nor the procedure 6, defined in Examples 9.3.8 and 9.3.9 is appropriate 
for testing the hypotheses (9.4.1), because each of those procedures has high power 
function only on one side of two-sided alternative H; and they each have low power 
function on the other side. However, the properties of the procedures 6, and 6 given 
in Sec. 9.3 and the fact that the sample mean X,, is the M.L.E. of jz suggest that a 
reasonable test of the hypotheses (9.4.1) would be to reject Ho if X,, is far from uo. 
In other words, it seems reasonable to use a test procedure 6 that rejects Ho if either 
X, <c, or X,, > cp, where c; and c> are two suitably chosen constants, presumably 
with cy < ug and cp > [UWp. 

If the size of the test is to be ap, then the values of c,; and cy must be chosen so 
as to satisfy the following relation: 


(9.4.1) 


Pr(X,, < cy|u = Uo) + Pr(X,, > C2|H = Mo) = A. (9.4.2) 


There are an infinite number of pairs of values of c, and c that satisfy Eq. (9.4.2). 
When jz = fo, the random variable n!/2(X,, — 19)/o has the standard normal distri- 
bution. If, as usual, we let ® denote the c.d.f. of the standard normal distribution, 
then it follows that Eq. (9.4.2) is equivalent to the following relation: 


V2 1/26, — 
of" (cy to aa |" (c 10 = ay (9.4.3) 


oO oO 


Corresponding to every pair of positive numbers a, and a, such that a; + a =a, 
there exists a pair of numbers c, and c, such that ®[n!/?(c, — uy)/o] =a and 1— 
®[n!/? (cy — x9)/o] = ay. Every such pair of values of c and c will satisfy Eqs. (9.4.2) 
and (9.4.3). 

For example, suppose that ap = 0.05. Then, choosing a, = 0.025 and a2 = 0.025 
yields a test procedure 63, which is defined by the values c, = jug — 1.960n~'/? and 
C7 = Uy + 1.960n-"/?. Also, choosing a; = 0.01 and a = 0.04 yields a test procedure 
54, which is defined by the values c, = wp — 2.330n~/? and cy = rg + 1.750n~'/?. The 
power functions z(j|53) and z(j2|54) of these test procedures 63 and 64 are sketched 


Figure 9.8 The power func- 
tions of four test procedures. 
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in Fig. 9.8, along with the power functions z(,2|6,) and z(j2|5), which had previously 
been sketched in Figs. 9.6 and 9.7. 

As the values of cy and c) in Eq. (9.4.2) or Eq. (9.4.3) are decreased, the power 
function z(|6) will become smaller for 4 < fg and larger for js > Wg. For ap = 
0.05, the limiting case is obtained by choosing c, = —00 and cy = Wy + 1.6450n-V/?, 
The test procedure defined by these values is just 61. Similarly, as the values of c, 
and c> in Eq. (9.4.2) or Eq. (9.4.3) are increased, the power function z(|6) will 
become larger for w < Wo and smaller for x > 4p. For ag = 0.05, the limiting case is 
obtained by choosing cy = 00 and c; = Wp — 1.6450n~"/7. The test procedure defined 
by these values is just 6). Something between these two extreme limiting cases seems 
appropriate for hypotheses (9.4.1). 


Selection of the Test Procedure 


For a given sample size n, the values of the constants c, and c, in Eq. (9.4.2) should 
be chosen so that the size and shape of the power function are appropriate for the 
particular problem to be solved. In some problems, it is important not to reject the 
null hypothesis unless the data strongly indicate that y differs greatly from jp. In 
such problems, a small value of a should be used. In other problems, not rejecting 
the null hypothesis Hp when vz is slightly larger than jp is a more serious error than 
not rejecting Hp when jz is slightly less than jzp. Then it is better to select a test having 
a power function such as 2 (j1|54) in Fig. 9.8 than to select a test having a symmetric 
function such as zr (j1|53). 

In general, the choice of a particular test procedure in a given problem should be 
based both on the cost of rejecting Hy when jz = jp and on the cost, for each possible 
value of j, of not rejecting Hy when uw 4 uo. Also, when a test is being selected, the 
relative likelihoods of different values of 4 should be considered. For example, if it 
is more likely that yz will be greater than jg than that jz will be less than jz, then it 
is better to select a test for which the power function is large when yz > fo, and not 
so large when jz < jo, than to select one for which these relations are reversed. 


Egyptian Skulls. Suppose that, in Example 9.4.1, it is equally important to reject the 
null hypothesis that the mean breadth ,z equals 140 when yz < 140 as when pu > 140. 
Then we should choose a test that rejects Hy) when the sample average X,, is either 
at most c; or at least c) where c, and cp are symmetric around 140. Suppose that we 
want a test of size wy = 0.05. There are n = 30 skulls from 4000 B.c., so 


c, = 140 — 1.96(26)!/730-1/? = 138.18, 
Co = 140 + 1.96(26)/7307'/? = 141.82. 
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Figure 9.9 The power func- 
tions for the level a = 0.05 
tests in Example 9.4.3 (equal 
tailed) and Example 9.4.4 
(likelihood ratio). The hori- 
zontal line is at height 0.05. 
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— Equal tailed 
LO Yanan Likelihood ratio 


The observed value of X,, is 131.37 in this case, and we would reject Hp at the level 
of significance 0.05. < 


In Examples 9.4.1 and 9.4.2, we would probably not wish to assume that the 
variance of the skull breadths was known to be 26, but rather we would assume that 
both the mean and the variance were unknown. We will see how to handle such a 
case in Sec. 9.5. 


Other Distributions 


The principles introduced above for samples from a normal distribution can be 
extended to any random sample. The details of implementation can be more tedious 
and less satisfying for other distributions. 


Service Times in a Queue. The manager in Example 9.3.2 models service times 
X1,..., X, a8i.i.d. exponential random variables with parameter 6 conditional on 6. 
Suppose that she wishes to test the null hypothesis Hp : @ = 1/2 versus the alternative 
H,:6 41/2. For the one-sided alternative 0 > 1/2, we found (in Example 9.3.2) that 
the UMP level ag test was to reject Hp if T = }~"_, X; is less than the a quantile of 
the gamma distribution with parameters n and 1/2. By similar reasoning, the UMP 
level ag test of Hp versus the other one-sided alternative 6 < 1/2 would be to reject 
Ho if T is greater than the 1 — ap quantile of the gamma distribution with parameters 
nand 1/2. A simple way to construct a level ag test of Hy :9 = 1/2 versus H, :6 #£1/2 
would be to apply the same reasoning that we applied immediately after Eq. (9.4.2). 
That is, combine two one-sided tests with levels a, and a where a; + a = aq. 

As a specific example, let a; = a = ay/2, and let G~1(-;n, 1/2) be the quantile 
function of the gamma distribution with parameters n and 1/2. Then, we reject Ho 
if T < G~'(ag/2;n, 1/2) or T > G~!(1 — ag/2;n, 1/2). For the case of ag = 0.05 and 
n = 3, the graph of the power function of this test appears in Fig. 9.9 together with 
the power function of the likelihood ratio test that will be derived in Example 9.4.4. 

| 


An alternative test in Example 9.4.3 would be the likelihood ratio test. In Exam- 
ple 9.4.3, the likelihood ratio test requires solving some nonlinear equations. 


Service Times in a Queue. Instead of the ad hoc two-sided test constructed in Exam- 
ple 9.4.3, suppose that the manager decides to find a likelihood ratio test. Suppose 


Example 
9.4.5 
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that }~"_, X; =1 is observed. The likelihood function is then 
f,(x|0) = 0" exp(—t0), for 6 > 0. 
The M.L.E. of 6 is 6 =n /t, so the likelihood ratio statistic from Definition 9.1.11 is 


A) = (1/2)" exp(—t/2) _ ( t 
(n/t)” exp(—n) 2n 


) exp(n — t/2). (9.4.4) 


The likelihood ratio test rejects Hp if A(x) < c for some constant c. From (9.4.4), we 
see that A(x) < c is equivalent to t < cy or t > cy) where cj < cp Satisfy 


(=) exp(n — c,/2) = (2) exp(n — 7/2). 
2n 2n 


In order for the test to have level ap, c; and c7 must also satisfy 
G(cy3n, 1/2) +1 — G(co3n, 1/2) = ao, 


where G(-;n, 1/2) is the c.d.f. of the gamma distribution with parameters n and 1/2. 
Solving these two equations for c, and c) would give us the likelihood ratio test. Using 
numerical methods, the solution is cj = 1.425 and c) = 15.897. The power function of 
the likelihood ratio test is plotted in Fig. 9.9 together with the power function of the 
equal-tailed test. < 


Composite Null Hypothesis 


From one point of view, it makes little sense to carry out a test of the hypothe- 
ses (9.4.1) in which the null hypothesis Ho specifies a single exact value jg for the 
parameter j. This is particularly true if we think of jz as the limit of the averages 
of increasing samples of future observations. Since it is inconceivable that jz will be 
exactly equal to (49 in any real problem, we know that the hypothesis Hp) cannot be 
true. Therefore, Hp should be rejected as soon as it has been formulated. 

This criticism is valid when it is interpreted literally. In many problems, however, 
the experimenter is interested in testing the null hypothesis Hp that the value of 
is close to some specified value jg against the alternative hypothesis that jz is not 
close to 4p. In some of these problems, the simple hypothesis Hp that 4. = wo can be 
used as an idealization or simplification for the purpose of choosing a decision. At 
other times, it is worthwhile to use a more realistic composite null hypothesis, which 
specifies that yw lies in an explicit interval around the value zo. We shall now consider 
hypotheses of this type. 


Testing an Interval Null Hypothesis. Suppose that X,,..., X,, form a random sample 
from the normal distribution with unknown mean p and known variance o” = 1, and 
suppose that the following hypotheses are to be tested: 


Ap: 9.9 < wu < 10.1, 


(9.4.5) 
Ay: uw <99 or uw > 10.1. 


Since the alternative hypothesis H, is two-sided, it is again appropriate to use a test 
procedure 6 that rejects Ho if either X,, < c, or X,, > cy. We shall determine the values 
of c, and cp for which the probability of rejecting Hp, when either 4 = 9.9 or w = 10.1, 
will be 0.05. 
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Figure 9.10 The power 
function (|5) for a test 
of the hypotheses (9.4.5). 


Let z(j|5) denote the power function of 5. When jz = 9.9, the random variable 
n'/2(X,, — 9.9) has the standard normal distribution. Therefore, 


(9.9|5) = Pr(Rejecting Ho|j = 9.9) 
= Pr(X,, <c|u = 9.9) + Pr(X,, > | = 9.9) (9.4.6) 
= &[n'/?(c, — 9.9] +1— &[n/? (cy — 9.9)]. 


Similarly, when jz = 10.1, the random variable n‘/2(X,, — 10.1) has the standard nor- 
mal distribution and 


(10.16) = ®[n/?(c, — 10.1)] +1 — &[n"(c, — 10.1)]. (9.4.7) 


Both (9.9|6) and 7(10.1|5) must be made equal to 0.05. Because of the symmetry 
of the normal distribution, it follows that if the values of c; and c) are chosen 
symmetrically with respect to the value 10, then the power function z(j|5) will be 
symmetric with respect to the point 4 = 10. In particular, it will then be true that 
m(9.9|5) = (10.15). 

Accordingly, let cy = 10 — c and c) = 10 + c. Then it follows from Eqs. (9.4.6) and 
(9.4.7) that 


1(9.9|5) = 2(10.1/5) = O[n”72(0.1—c)]4+1-O[n'200.14+ 0]. (9.4.8) 

The value of c must be chosen so that 2(9.9|6) = 7(10.1|5) = 0.05. Therefore, c must 
be chosen so that 

&[n/7(0.1+ c)] — [n"7(0.1 — c)] = 0.95. (9.4.9) 


For each given value of n, the value of c that satisfies Eq. (9.4.9) can be found by 
trial and error from a table of the standard normal distribution or using statistical 
software. 

For example, if n = 16, then c must be chosen so that 


(0.4 + 4c) — (0.4 — 4c) = 0.95. (9.4.10) 


After trying various values of c, we find that Eq. (9.4.10) will be satisfied when 
c = 0.527. Hence, 


cy = 10 — 0.527 = 9.473 and cp) = 10 + 0.527 = 10.527. 


Thus, when n = 16, the procedure 6 rejects Hy when either X,, < 9.437 or X,, > 
10.527. This procedure has a power function x (jz|5), which is symmetric with respect 
to the point 4 = 10 and for which 2 (9.9|5) = 7(10.1|6) = 0.05. Furthermore, it is true 
that z(1|5) < 0.05 for 9.9 < w < 10.1 and z(|5) > 0.05 for w < 9.9 or w > 10.1. The 
function z(j1|5) is sketched in Fig. 9.10. < 


%, 
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Unbiased Tests 
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Definition 
9.4.1 


Example 
9.4.6 


Consider the general problem of testing the following hypotheses: 
A: OG € Qo, 
A: de Q4. 


As usual, let 7(6|5) denote the power function of an arbitrary test procedure 4. 


Unbiased Test. A test procedure 6 is said to be unbiased if, for every @ € Qo and 
OE Q4, 


1(0|8) <2 (6'|5). (9.4.11) 


In words, 5 is unbiased if its power function throughout Q is at least as large as it is 
throughout Qo. 

If one closely examines Fig. 9.9, one sees that for values of @ slightly above 1/2, 
the power function of the equal-tailed test dips below 0.05 (the value of the power 
function at @ = 1/2). This means that the test is not unbiased. This is typical in cases 
where the distribution of the test statistic T is not symmetric but a two-sided test 
is created by combining two one-sided tests. It is easy to see that an unbiased test 
would need to have a power function with derivative equal to 0 at @ = 1/2; otherwise, 
it would dip below 0.05 on one side or the other of 6 = 1/2. 

In many problems, the power function of every test is differentiable as a function 
of 9. In such cases, in order to create an unbiased level a test 6 of Hp : 9 = 09 versus 
H,:0 #4, we would need 


I (Ap|5) =a, and 
45@|3)|  =0. (9.4.12) 
dé 6=69 


Such equations would need to be solved numerically in any real problem. Typically, 
researchers don’t think it is worth the trouble to solve such equations just to find an 
unbiased test. 


Service Times in a Queue. In Example 9.4.4, let T = }~"_, X;. If we want an unbiased 
test of the form “reject Hp if T < c, or if T > cy,” the power function will be 

1(0|8) = G(cy;n, 6) +1 — Gen, 6), 
where G(-;n, 6) is the c.d-f. of T given 6, 


x go” 1 
G(x3n, 0) = t”~* exp(—t0)dt, 
(sin, 0) = [Eat expi—re) 


for t > 0. Eq. (9.4.12) requires that we compute the derivative of G with respect to 
9. The derivative with respect to @ can be passed under the integral, and the result is 


x n—-1 


= Gein, 6) = f dl SS ea 
a0 0 @—D! 


(9.4.13) 


e 4 ge” 41 
— t ———1" —t0)dt. 
} (n — 1)! mek ) 


The reader can show (see Exercise 13 in this section) that (9.4.13) can be rewritten as 


—Gtxin, O)y= A [G(x;n, 0) — G(x;n +1, 6)]. (9.4.14) 
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For a = 0.05 and n = 3, the two equations we need to solve for c, and cp are 


G(cy; 3, 1/2) + 1 — G(c9;3, 1/2) = 0.05, 
3 
— [G(x;3, 1/2) — G(x; 4, 1/2)] =0. 
172 [G(; 3, 1/2) — G(x; 4, 1/2)] 
Solving these two equations numerically gives the same solution as the likelihood 
ratio test to the number of significant digits reported in Example 9.4.4. This explains 
why the power function of the likelihood ratio test appears not to dip below 0.05 
anywhere. < 


Intuitively, the notion of an unbiased test sounds appealing. Since the goal of a 
test procedure is to reject Hy when @ € Q, and not to reject Hy when 6 € Qo, it seems 
desirable that the probability of rejecting Hp should be at least as large when 6 € Qy 
as it is whenever @ € Q. It can be seen that the test 6 for which the power function 
is sketched in Fig. 9.10 is an unbiased test of the hypotheses (9.4.5). Also, among the 
four tests for which the power functions are sketched in Fig. 9.8, only 53 is an unbiased 
test of the hypotheses (9.4.1). Although it is beyond the scope of this book, one can 
show that 53 is UMP among all unbiased level a) = 0.05 tests of (9.4.1). 

The requirement that a test is to be unbiased can sometimes narrow the selection 
of a test procedure. However, unbiased procedures should be sought only under 
relatively special circumstances. For example, when testing the hypotheses (9.4.5), 
the statistician should use the unbiased test 5 represented in Fig. 9.10 only under the 
following conditions: He believes that, for every value a > 0, it is just as important 
to reject Hy when 6 = 10.1+ a as to reject Hy when 6 = 9.9 — a, and he also believes 
that these two values of 6 are equally likely. In practice, the statistician might very 
well forego the use of an unbiased test in order to use a biased test that has higher 
power in certain regions of 92, that he regards as particularly important or most likely 
to contain the true value of @ when Hp is false. 


>, 
“ 


In the remainder of this chapter, we shall consider special testing situations that 
arise very often in applied work. In these situations, there do not exist UMP tests. 
We shall study the most popular tests in these situations, and we shall show that these 
tests are likelihood ratio tests. However, in more advanced courses, it can be shown 
that the t tests and F tests derived in Sections 9.5, 9.6, and 9.7 are all UMP among 
various classes of unbiased tests of their sizes. 


Summary 


For the case of testing that the mean of a normal distribution with known variance 
equals a specific value against the two-sided alternative, one can construct level ag 
tests by combining the rejection regions of two one-sided tests of sizes a and a such 
that ag = a; + a. A popular choice is a1 = a = ap /2. In this case, if X;,..., X,, forma 
random sample from a normal distribution with mean ju and variance o”, one can test 
Ho: u = uo Versus Hy : u # Uo by rejecting Hp if X,, > ug + © 11 — ap/2)a/n"” or if 
X, < Mo — ® 11 — a/2)a/n'/”, where &~ is the quantile function of the standard 
normal distribution. A test is unbiased if its power function is greater at every point 
in the alternative hypothesis than at every point in the null hypothesis. The normal 
distribution test just described, with a, = a7 = ao/2, is unbiased. 


Exercises 


1. Suppose that X;,..., X, form a random sample from 
the normal distribution with unknown mean yw and known 
variance 1, and it is desired to test the following hypothe- 
ses for a given number j20: 


Ho: =o, 

Ay: & # Lo. 
Consider a test procedure 6 such that the hypothesis Ho 
is rejected if either X,, <c, or X, >c>, and let (ud) 
denote the power function of 5. Determine the values of 
the constants cy and cp such that z(o|5) = 0.10 and the 
function z(|5) is symmetric with respect to the point 
= Uo. 
2. Consider again the conditions of Exercise 1, and sup- 
pose that 

cy =-Mo- 1.96n—1/?. 


Determine the value of cz such that m(j19|6) = 0.10. 


3. Consider again the conditions of Exercise 1 and also the 
test procedure described in that exercise. Determine the 
smallest value of n for which z (9/6) = 0.10 and m(up + 
1|5) = 2(g — 1]5) = 0.95. 


4. Suppose that X),..., X, form a random sample from 
the normal distribution with unknown mean pu and known 
variance 1, and it is desired to test the following hypothe- 
Ses: 

Ho: 0.1<p <0.2, 

Ay: w<O..orpu> 0.2. 


Consider a test procedure 6 such that the hypothesis Hp is 
rejected if either X,, < cy or X,, > c, and let 2(|5) denote 
the power function of 5. Suppose that the sample size is 
n = 25. Determine the values of the constants c, and c 
such that (0.1/6) = 2 (0.2|6) = 0.07. 


5. Consider again the conditions of Exercise 4, and sup- 
pose also that n = 25. Determine the values of the con- 
stants cy and co such that 2(0.1|6) = 0.02 and 2(0.2|5) = 
0.05. 


6. Suppose that X,..., X,, form a random sample from 
the uniform distribution on the interval [0, 0], where the 
value of 6 is unknown, and it is desired to test the following 
hypotheses: 

Ho: 0< 3, 

A: 6 > 3. 


a. Show that for each level of significance aj (0 < ag < 
1), there exists a UMP test that specifies that Hp 
should be rejected if max{X;,..., X,} =. 

b. Determine the value of c for each possible value 
of Qo. 


7. Fora given sample size n and a given value of ap, sketch 
the power function of the UMP test found in Exercise 6. 
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8. Suppose that X,,..., X,, form a random sample from 
the uniform distribution described in Exercise 6, but sup- 
pose now that it is desired to test the following hypotheses: 


Ho: 6>3, 
A: 6 <3. 


a. Show that at each level of significance a 
(0 < a <1), there exists a UMP test that specifies 
that Hp should be rejected if max{X,..., X,}<c. 


b. Determine the value of c for each possible value of 
ao. 


9. Fora given sample size n and a given value of ap, sketch 
the power function of the UMP test found in Exercise 8. 


10. Suppose that Xj, ..., X,, form a random sample from 
the uniform distribution described in Exercise 6, but sup- 
pose now that it is desired to test the following hypotheses: 


Ho: G= 3, 


nee (9.4.15) 


Consider a test procedure 5 such that the hypothesis Ho is 
rejected if either max{X,,..., X,} <c, or max{Xj,..., 
X,} > co, and let 2(6|5) denote the power function of 6. 


a. Determine the values of the constants c; and c7 such 
that 2 (3|6) = 0.05 and 6 is unbiased. 

b. Prove that the test found in part (a) is UMP of level 
0.05 for testing the hypotheses in (9.4.15). Hint: Com- 
pare this test to the UMP tests of level aj = 0.05 in 
Exercises 6 and 8. 


c. Determine the values of the constants c, and cj such 
that 2 (3|6) = 0.05 and 6 is unbiased. 


11. Consider again the conditions of Exercise 1. De- 
termine the values of the constants c; and c such that 
I({49|6) = 0.10 and 6 is unbiased. 


12. Let X have the exponential distribution with param- 
eter 8. Suppose that we wish to test the hypotheses 


Ho: B = 1, 

Ai: B x 1. 
We shall use a test procedure that rejects Hp if either 
X< cy, Or xX> C2. 

a. Find the equation that must be satisfied by c; and 
cp in order for the test procedure to have level of 
significance a. 

b. Find a pair of finite, nonzero values (c;, cz) such that 
the test procedure has level of significance ap = 0.1. 


13. Prove Eq. (9.4.14) in Example 9.4.6.Hint: Both parts 
of the integrand in Eq. (9.4.13) differ from gamma distri- 
bution p.d.f.’s by some factor that does not depend on fr. 
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Example 
9.5.1 


Figure 9.11 The subsets 
Qo and Q, of the parameter 
space Q for the hypothe- 
ses (9.5.1). 


9.5 Thet Test 


We begin the treatment of several special cases of testing hypotheses about param- 
eters of anormal distribution. In this section, we handle the case in which both the 
mean and the variance are unknown. We develop tests for hypotheses concerning 
the mean. These tests will be based on the t distribution. 


Testing Hypotheses about the Mean of a Normal Distribution 
When the Variance Is Unknown 


Nursing Homes in New Mexico. In Example 8.6.3, we described a study of medical 
in-patient days in nursing homes in New Mexico. As in that example, we shall model 
the numbers of medical in-patient days as arandom sample of n = 18 normal random 
variables with unknown mean yu and unknown variance o”. Suppose that we are 
interested in testing the hypotheses Ho : u > 200 versus Hy : w < 200. What test should 
we use, and what are its properties? < 


In this section we shall consider the problem of testing hypotheses about the 
mean of a normal distribution when both the mean and the variance are unknown. 
Specifically, we shall suppose that the random variables X,,..., X, form a random 


sample from a normal distribution for which the mean yu and the variance o” are 
unknown, and we shall consider testing the following hypotheses: 
Ho: < Lo, 
eee (9.5.1) 
Ay: > Lo. 


The parameter space Q in this problem comprises every two-dimensional vector 
(u, o7), where —oo < pp < 00 and o” > 0. The null hypothesis Hp specifies that the 
vector (j1, 07) lies in the subset Qo of Q, comprising all vectors for which jz < jzg and 
o* > 0, as illustrated in Fig. 9.11. The alternative hypothesis H, specifies that (1, 07) 
belongs to the subset Q, of Q, comprising all the vectors that do not belong to Qo. 

In Example 9.1.17 on page 543, we showed how to derive a test of the hy- 
potheses (9.5.1) from a one-sided confidence interval for jw. To be specific, define 
Xn = i, Xi /n, 0! = (_(X; — X,,)"/[n — 1p", and 


U =neXn— Ho, (9.5.2) 
oO 


The test rejects Hy if U >c. When p = po, it follows from Theorem 8.4.2 that the 
distribution of the statistic U defined in Eq. (9.5.2) is the ¢ distribution with n — 1 


QQ O, 


Example 
9.5.2 


Theorem 
9.5.1 
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degrees of freedom, regardless of the value of a”. For this reason, tests based on U 
are called ¢ tests. When we want to test 


Ho: 2 bo, 
Ay: Mb <Mo, 
the test is of the form “reject Hy if U <c.” 


(9.5.3) 


Nursing Homes in New Mexico. In Example 9.5.1, if we desired a level ag test, we 
could use the ¢ test that rejects Ho if the statistic U in Eq. (9.5.2) is at most equal to 
the constant c chosen to make the size of the test equal to ao. < 


Properties of the t Tests 


Theorem 9.5.1 gives some useful properties of f tests. 


Level and Unbiasedness of t Tests. Let ¥ = (X,..., X,,) be arandom sample from the 
normal distribution with mean jz and variance o”, let U be the statistic in Eq. (9.5.2), 
and let c be the 1 — ag quantile of the ¢ distribution with n — 1 degrees of freedom. 
Let 6 be the test that rejects Hp in (9.5.1) if U > c. The power function (w, o7|5) has 
the following properties: 
i. (uw, 07/5) =a when p = LU, 
ii. (u, 07|5) < ag when p < [Ug, 
iii. (w, 07|5) > ap When p > Lo, 
iv. (1, 02|5) > Oas up > —0w, 
v. (WL, 02/8) > Las “> oo. 


Furthermore, the test 6 has size ag and is unbiased. 


Proof If 2 = wo, then U has the r distribution with n — 1 degrees of freedom. Hence, 
(Jp, 07/8) = Pr(U > c|u0, 07) = ao. 
This proves (i) above. For (ii) and (iii), define 
U*= ne Xn — #) and W= 
i 
Then U = U* — W. First, assume that w < up so that W > 0. It follows that 


n'/? (4 — 1) 


o’ 


n(, 07|6) =Pr(U > clu, 0”) = Pr(U* — W > clu, 0”) 
= Pr(U* >c+ Wy, 0) < Pr(U* > cl, 0”). (9.5.4) 


Since U* has the ¢ distribution with n — 1 degrees of freedom, the last probability 
in (9.5.4) is ag. This proves (ii). For (iii), let 4. > 9 so that W <0. The less-than in 
(9.5.4) becomes a greater-than, and (iii) is proven. 

That the size of the test is ag is immediate from parts (i) and (ii). That the test is 
unbiased is immediate from parts (i) and (iii). 

The proofs of (iv) and (v) are more difficult and will not be given here in detail. 
Intuitively, if . is very large, then W in Eq. (9.5.4) will tend to be very negative, and 
the probability will be close to 1 that U* > c + W. Similarly, if 4 is very much less 
than 0, then W will tend to be very positive, and the chance of U* > c+ W will be 
close to 0. a 


For the hypotheses of Eq. (9.5.3), very similar properties hold. 
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Corollary 
9.5.1 


Example 
9.5.3 


Theorem 
9.5.2 


Example 
9.5.4 


t Tests for Hypotheses of Eq. (9.5.3). Let ¥ = (X),..., X,) be a random sample 
from the normal distribution with mean pz and variance o”, let U be the statistic 
in Eq. (9.5.2), and let c be the ap quantile of the t distribution with n — 1 degrees 
of freedom. Let 6 be the test that rejects Ho in (9.5.3) if U < c. The power function 
(jt, 07|5) has the following properties: 


i. (uw, 07/6) =a when p = Lo, 
ii. 2(u, 07|5) > ag when pw < pg, 
iii. (2, 07/5) < a when pt > LUo, 
iv. (1, 02|5) > Las 4 > —00, 

Vv. (Lt, 07|5) > Oas ft > 00. 


Furthermore, the test 6 has size ap and is unbiased. 


Nursing Homes in New Mexico. In Examples 9.5.1 and 9.5.2, suppose that we desire 
a test with level of significance wy = 0.1. Then we reject Ho if U <c where c is the 
0.1 quantile of the ¢ distribution with 17 degrees of freedom, namely, —1.333. Using 
the data from Example 8.6.3, we calculate the observed value of X;g = 182.17 and 
o/ = 72.22. The observed value of U is then (17)!/2(182.17 — 200) /72.22 = —1.018. 
We would not reject Hp : 4 > 200 at level of significance 0.1, because the observed 
value of U is greater than —1.333. < 


p-Values for t Tests The p-value from the observed data and a specific test is the 
smallest a such that we would reject the null hypothesis at level of significance ag. For 
the rt tests that we have just discussed, it is straightforward to compute the p-values. 


p-Nalues for t Tests. Suppose that we are testing either the hypotheses in Eq. (9.5.1) 
or the hypotheses in Eq. (9.5.3). Let u be the observed value of the statistic U in 
Eq. (9.5.2), and let T,_;(-) be the c.d.f. of the ¢ distribution with n — 1 degrees of 
freedom. Then the p-value for the hypotheses in Eq. (9.5.1) is 1 — T,_;(u) and the 
p-value for the hypotheses in Eq. (9.5.3) is 7,,_4(u). 


Proof Let 7 ,(-) stand for the quantile function of the r distribution with n — 1 
degrees of freedom. This is the inverse of the strictly increasing function T,,_;. We 
would reject the hypotheses in Eq. (9.5.1) at level wp if and only if u > es — ao), 
which is equivalent to T,_j;(u) >1— ap, which is equivalent to a > 1-— T,,_)(u). 
Hence, the smallest level wp) at which we could reject Ho is 1 — T,,_\(u). Similarly, 
we would reject the hypotheses in Eq. (9.5.3) if and only if u < T' (a), which is 
equivalent to a > T,,_1(u). a 


Lengths of Fibers. Suppose that the lengths in millimeters of metal fibers produced by 
a certain process have the normal distribution with unknown mean yw and unknown 
variance o”, and the following hypotheses are to be tested: 
Ho: Ms 5.2, 
Ay: w>S.2. 
Suppose that the lengths of 15 fibers selected at random are measured, and it is found 
that the sample mean X15 is 5.4 and o’ = 0.4226. Based on these measurements, we 
shall carry out a f test at the level of significance aj = 0.05. 


Since n = 15 and jp = 5.2, the statistic U defined by Eq. (9.5.2) will have the t 
distribution with 14 degrees of freedom when jz = 5.2. It is found in the table of the 


(9.5.5) 
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t distribution that 7 (095) = 1.761. Hence, the null hypothesis Ho will be rejected 
if U > 1.761. Since the numerical value of U calculated from Eq. (9.5.2) is 1.833, Ho 
would be rejected at level 0.05. 

With observed value u = 1.833 for the statistic U andn = 15, we can compute the 
p-value for the hypotheses (9.5.1) using computer software that includes the c.d.f. of 
various f distributions. In particular, we find 1 — 7,4(1.833) = 0.0441. <J 


The Complete Power Function For all values of jz, the power function of a f test 
can be determined if we know the distribution of U defined in Eq. (9.5.2). We can 
rewrite U as 

ifs mln = Ko) (9.5.6) 

o'/o 

The numerator of the right side in Eq. (9.5.6) has the normal distribution with mean 
n/?(44 — ug)/o and variance 1. The denominator is the square-root of a x7 random 
variable divided by its degrees of freedom, n — 1. Were it not for the nonzero mean, 
the ratio would have the f distribution with n — 1 degrees of freedom as we have 
already shown. When the mean of the numerator is not 0, U has a noncentral t 
distribution. 


Noncentral ¢ Distributions. Let Y and W be independent random variables with W 
having the normal distribution with mean w and variance 1 and Y having the x? 
distribution with m degrees of freedom. Then the distribution of 


W 


( Y yo 
m 


is called the noncentral t distribution with m degrees of freedom and noncentrality 
parameter w. We shall let T,,(t|y) denote the c.d.f. of this distribution. That is, 
Tin (tly) = Pr(X <1). 


It should be obvious that the noncentral ¢ distribution with m degrees of free- 
dom and noncentrality parameter y = 0 is also the ¢ distribution with m degrees of 
freedom. The following result is also immediate from Definition 9.5.1. 


Let Xj,..., X,, be arandom sample from the normal distribution with mean jz and 
variance o*. The distribution of the statistic U in Eq. (9.5.2) is the noncentral t 
distribution with n — 1 degrees of freedom and noncentrality parameter y = n/?(u — 
to) /o. Let 6 be the test that rejects Hy : uw < “4p When U > c. Then the power function 
of 6 is m(u, o7|5) = 1 — T,_;(clw). Let 6’ be the test that rejects Hy : u > wo when 
U <c. Then the power function of 6’ is m(, 07|6’) = T,_4(clW). rT] 


In Exercise 11, you can prove that 1 — T,,(t|w) = T,,(—t| — w). There are computer 
programs to calculate the c.d.f.s of noncentral t distributions, and some statistical 
software packages include such programs. Figure 9.12 plots the power functions of 
level 0.05 and level 0.01 t tests for various degrees of freedom and various values 
of the noncentrality parameter. The horizontal axis is labeled |w| because the same 
graphs can be used for both types of one-sided hypotheses. The next example illus- 
trates how to use Fig. 9.12 to approximate the power function. 


Lengths of Fibers. In Example 9.5.4, we tested the hypotheses (9.5.5) at level 0.05. 
Suppose that we are interested in the power of our test when sz is not equal to 5.2. In 
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Figure 9.12 The power 
functions on the alternative 
of one-sided level 0.05 
and level 0.01 ¢ tests with 
various degrees of freedom 
for various values of the 
noncentrality parameter y. 


Example 
9.5.6 


Level 0.05 Level 0.01 
A Degrees of freedom 


A Degrees of freedom 


1.0 + sas 3 


Power of one-sided f test 
Power of one-sided f test 


particular, suppose that we are interested in the power when pz = 5.2 + 0/2, one-half 
standard deviation above 5.2. Then the noncentrality parameter is 


w =15"/? (SMe) $085. 


oO 


There is no curve for 14 degrees of freedom in Fig. 9.12; however, there is not much 
difference between the curves for 10 and 60 degrees of freedom, so we can assume 
that our answer is somewhere between those two. If we look at the level 0.05 plot in 
Fig. 9.12 and move up from 1.936 (about 2) on the horizontal axis until we get a little 
above the curve for degrees of freedom equal to 10, we find that the power is about 
0.6. (The actual power is 0.578.) < 


Note: Power is a Function of the Noncentrality Parameter. In Example 9.5.5, we 
cannot answer a question like “What is the power of a level 0.05 test when px = 5.5?” 
The reason is that the power is a function of both jz and o through the noncentrality 
parameter. (See Exercise 6.) For each possible o and w =5.5, the noncentrality 
parameter is y =15'/? x 0.3/o, which varies from 0 to oo depending on o. This is 
why, whenever we want a numerical value for the power of ar test, we need either 
to specify both jz and o or to specify how far yu is from jzp in multiples of o. 


Choosing a Sample Size It is possible to use the power function of a test to help 
determine what would be an appropriate sample size to observe. 


Lengths of Fibers. In Example 9.5.5, we found that the power of the test was 0.578 
when « =5.2+ 0/2. Suppose that we want the power to be close to 0.8, when 
fe =5.2 + 0/2. It will take more than n = 15 observations to achieve this. In Fig. 9.12, 
we can see what size of noncentrality parameter y that we need in order for the 
power to reach 0.8. For degrees of freedom between 10 and 60, we need wy to be 
about 2.5. But w =n/?/2 when yp =5.2 +.0/2. So we need n = 25 approximately. 
Precise calculation shows that, with n = 25, the power of the level 0.05 test is 0.7834 
when pw =5.2 + 0/2. With n = 26, the power is 0.7981, and with n = 27 the power is 
0.8118. < 


The Paired t Test 


In many experiments, the same variable is measured under two different conditions 
on the same experimental unit, and we are interested in whether the mean value is 


Figure 9.13 Plot of loga- 
rithms of head injury mea- 
sures for dummies on driver’s 
side and passenget’s side. The 
line indicates where the two 
measures are equal. 


Example 
9.5.7 


Example 
9.5.8 
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greater in one condition than in the other. In such cases, it is common to subtract 
the two measurements and treat the differences as a random sample from a normal 
distribution. We can then test hypotheses concerning the mean of the differences. 


Crash Test Dummies. The National Transportation Safety Board collects data from 
crash tests concerning the amount and location of damage on dummies placed in 
the tested cars. In one series of tests, one dummy was placed in the driver’s seat 
and another was placed in the front passenger’s seat of each car. One variable 
measured was the amount of injury to the head for each dummy. Figure 9.13 shows 
a plot of the pairs of logarithms of head injury measures for dummies in the two 
different seats. Among other things, interest lies in whether and/or to what extent 
the amount of head injury differs between the driver’s seat and the passenger’s seat. 
Let X;,..., X, be the differences between the logarithms of head injury measures 
for driver’s side and passenger’s side. We can model X1,..., X,, aS arandom sample 
from a normal distribution with mean jz and variance o”. Suppose that we wish to 
test the null hypothesis Ho : ~ < 0 against the alternative H,: uw > 0 at level ag = 0.01. 
There are n = 164 cars represented in Fig. 9.13. The test would be to reject Hp if 
U > T,6,(0.99) = 2.35. 

The average of the differences of the coordinates in Fig. 9.13 is x,, = 0.2199. The 
value of o’ is 0.5342. The statistic U is then 5.271. This is larger than 2.35, and the null 
hypothesis would be rejected at level 0.01. Indeed, the p-value is less than 1.0 x 10~°. 

Suppose also that we are interested in the power function under H, of the level 
0.01 test. Suppose that the mean difference between driver’s side and passenget’s side 
logarithm of head injury is o/4. Then the noncentrality parameter is (164)!/*/4 = 3.20. 
In the right panel of Fig. 9.12, it appears that the power is just about 0.8. (In fact, it 
is 0.802.) < 


Testing with a Two-Sided Alternative 


Egyptian Skulls. In Examples 9.4.1 and 9.4.2, we modeled the breadths of skulls from 
4000 B.c. as arandom sample of size n = 30 from a normal distribution with unknown 
mean jy and known variance. We shall now generalize that model to allow the more 
realisitc assumption that the variance o? is unknown. Suppose that we wish to test 
the null hypothesis Hp : 4 = 140 versus the alternative hypothesis H, : 4 4 140. We 
can still calculate the statistic U in Eq. (9.5.2), but now it would make sense to reject 
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Example 
9.5.9 


Example 
9.5.10 


Ho if either U < cy or U > cp for suitably chosen numbers c; and c7. How should we 
choose c; and c, and what are the properties of the resulting test? J 


As before, assume that X¥ = (Xj,..., X,) 1s a random sample from a normal 
distribution for which both the mean yw and the variance o* are unknown. Suppose 
now that the following hypotheses are to be tested: 


Ho: = Lo; 

Ay: “Flo. 

Here, the alternative hypothesis H, is two-sided. 

In Example 9.1.15, we derived a level ag test of the hypotheses (9.5.7) from the 

confidence interval that was developed in Sec. 8.5. That test has the form “reject Ho 

if |U| = ipa al — ap/2),” where ar is the quantile function of the r distribution with 
n — 1 degrees of freedom and U is defined in Eq. (9.5.2). 


(9.5.7) 


Egyptian Skulls. In Example 9.5.8, suppose that we want a level ay = 0.05 test of 
Ho: 4 =140 versus H,: 4140. If we use the test described above (derived in 
Example 9.1.15), then the two numbers c, and c> will be of opposite signs and equal 
in magnitude. Specifically, cj = —Ty5'(0.975) = —2.045 and cy = 2.045. The observed 
value of X39 is 131.37, and the observed value of o’ is 5.129. The observed value u of 
the statistic U is u = (30)'/2(131.37 — 140)/5.129 = —9.219. This is less than —2.045, 
so we would reject Hp at level 0.05. < 


Lengths of Fibers. We shall consider again the problem discussed in Example 9.5.4, but 
we shall suppose now that, instead of the hypotheses (9.5.5), the following hypotheses 
are to be tested: 

Ho: h= 5.2. 

Ay: yh # 52. 
We shall again assume that the lengths of 15 fibers are measured, and the value of U 
calculated from the observed values is 1.833. We shall test the hypotheses (9.5.8) at 
the level of significance ap = 0.05. 

Since a = 0.05, our critical value will be the 1 — 0.05/2 = 0.975 quantile of the t 
distribution with 14 degrees of freedom. From the table of r distributions in this book, 
we find Tj: (0.975) = 2.145. So the ¢ test specifies rejecting Ho if either U < —2.145 
or U > 2.145. Since U = 1.833, the hypothesis Hp would not be rejected. 4 


(9.5.8) 


The numerical values in Examples 9.5.4 and 9.5.10 emphasize the importance 
of deciding whether the appropriate alternative hypothesis in a given problem is 
one-sided or two-sided. When the hypotheses (9.5.5) were tested at the level of signif- 
icance 0.05, the hypothesis Ho that  < 5.2 was rejected. When the hypotheses (9.5.8) 
were tested at the same level of significance, and the same data were used, the hy- 
pothesis Hp that = 5.2 was not rejected. 


Power Functions of Two-Sided Tests The power function of the test 5 that rejects 
Hy: 4 = Lo When |U| > c, where c= emer — ag/2), can be found by using the non- 
central ¢ distribution. If u 4 jp, then U has the noncentral t distribution with n — 1 
degrees of freedom and noncentrality parameter y = n'/?(u — jup)/o, just as it did 
when we tested one-sided hypotheses. The power function of 6 is then 


m(t, 67/5) = T,_4(—clv) + 1- T_1(cl). 


Figure 9.14 The power 
functions of two-sided level 
0.05 and level 0.01 ¢ tests with 
various degrees of freedom 
for various values of the 
noncentrality parameter y. 
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Figure 9.14 plots these power functions for various degrees of freedom and noncen- 
trality parameters. We could use Fig. 9.14 to find the power of the test in Exam- 
ple 9.5.10 when wp = 5.2 + 0/2, that is, when y = 1.936. It appears to be about 0.45. 
(The actual power is 0.438.) 


p-Values for Two-Sided t Tests. Suppose that we are testing the hypotheses in Eq. 
(9.5.7). Let u be the observed value of the statistic U, and let 7,,_(-) be the c.d.f. of 
the t distribution with n — 1 degrees of freedom. Then the p-value is 2[1 — T,,_1(|u|)]. 


Proof Let 7 |(-) stand for the quantile function of the r distribution with n — 1 
degrees of freedom. We would reject the hypotheses in Eq. (9.5.7) at level ag if 
and only if |u| > aay a — ao/2), which is equivalent to 7,,_4(\u|) > 1— a 9/2, which 
is equivalent to ag > 2[1 — T,,_1(|u|)]. Hence, the smallest level wp) at which we could 
reject Ho is 2[1 — T,_1(\u|)]. r 


Lengths of Fibers. In Example 9.5.10, the p-value is 2[1 — T,4(1.833)] = 0.0882. Note 
that this is twice the p-value when the hypotheses were (9.5.1). < 


For ¢ tests, if the p-value for testing hypotheses (9.5.1) or (9.5.3) is p, then the p-value 
for hypotheses (9.5.7) is the smaller of 2p and 2(1 — p). 


| The t Test as a Likelihood Ratio Test 


Example 
9.5.12 


We introduced likelihood ratio tests in Sec. 9.1. We can compute such tests for the 
hypotheses of this section. 


Likelihood Ratio Test of One-Sided Hypotheses about the Mean of a Normal Distribu- 
tion. Consider the hypotheses (9.5.1). After the values x,,..., x, in the random 
sample have been observed, the likelihood function is 


1 1 n 
fnls 0°) = ap el 5 dH »? (9.5.9) 


In this case, Q9 = {(u, 07) : uw < Wo} and Q, = {(w, 07) : 4 > fo}. The likelihood ratio 
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statistic is 
2 
SUP ((1,62):4>19} fr@lu, o*) 


SUP (1.02) file, oa) 


A(x) = (9.5.10) 


We shall now derive an explicit form for the likelihood ratio test based on 
(9.5.10). As in Sec. 7.5, we shall let fi and 6” denote the M.L.E.’s of and o? when it 


is known only that the point (1, 07) belongs to the parameter space Q. It was shown 
in Example 7.5.6 that 


n 

oe a2_ 1 a 

M=Xn and o =—) (x; — X,)"- 
i=l 


It follows that the denominator of A(x) equals 


2 1 n 
es fn(X|u, 07) = Onda" exp ( *) : (9.5.11) 
Similarly, we shall let ig and 65 denote the M.L.E.’s of « and o? when the point 
(u, 07) is constrained to lie in the subset Qo. Suppose first that the observed sample 
values are such that X,, < jp. Then the point (ji, 67) will lie in Qg so that fig = fi and 
un = 6” and the numerator of A(x) also equals (9.5.11). In this case, A(x) = 1. 
Next, suppose that the observed sample values are such that x, > jp. Then the 
point (4, 67) does not lie in Qo. In this case, it can be shown that f,,(x|2, 0”) attains its 
maximum value among all points (wu, 0”) € Qo if 1 is chosen to be as close as possible 
to x,. The value of closest to x,, among all points in the subset Qo is 4 = (4p. Hence, 
(io = Mo. In turn, it can be shown, as in Example 7.5.6, that the M.L.E. of o will be 


7 1 n . 1 n 
ah =* Dee — Ay) =* 2G — 0? 
i=l i=l 


In this case, the numerator of A(x) is then 


1 n 
sup f,(x|u, 0°) = —— exp ( ) ; (9.5.12) 
{(u,07):4> 9} " (Qn65 yr? 2 


Taking the ratio of (9.5.12) to (9.5.11), we find that 
a2 n/2 
Oo p= 
A(x) = (S if X, > Los (9.5.13) 
0 
otherwise. 


Next, use the relation 


Yi — Ho)” = 0G; — 8," + 2p — Mo)? 


i=1 i=1 
to write the top branch of (9.5.13) as 
—n/2 
n (Xn — Ho)” 
14+ ——4"—_—~ . (9.5.14) 
pee? ~ ay 
If u is the observed value of the statistic U in Eq. (9.5.2), then one can easily check 
that 
n(Kn — Mo) ur 


yj -X,)%  n-1 
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It follows that A(x) is a nonincreasing function of u. Hence, fork < 1, A(x) < kif and 


only if u > c, where 
1/2 
1 


It follows that the likelihood ratio test is a ¢ test. <J 


It is not difficult to adapt the argument in Example 9.5.12 to find the likelihood 
ratio tests for hypotheses (9.5.3) and (9.5.7). (See Exercises 17 and 18, for example.) 


0, 
“ 


Summary 


When X;,..., X,, formarandom sample from the normal distribution with unknown 
mean jz and unknown variance o”, we can test hypotheses about ju by using the fact 
that n'/*(X,, — )/o’ has the t distribution with n — 1 degrees of freedom. Let aan 
denote the quantile function of the ¢ distribution with n — 1 degrees of freedom. 
Then, to test Hy : 4 < Wo versus H,: > 0 at level ag, for instance, we reject Ho if 
n(X, = tah fe" = aaa — ap). To test Hp: 4 = Wo Versus Hy: uw A [o, reject Ho if 
\n'/2(X,, — U9) /o'| => jes al — aq/2). The power functions of each of these tests can 
be written in terms of the c.d-f. of a noncentral r distribution with n — 1 degrees of 


freedom and noncentrality parameter w = n'/?(u — p)/o. 


Exercises 


1. Use the data in Example 8.5.4, comprising a sample of 
n = 10 lactic acid measurements in cheese. Assume, as we 
did there, that the lactic acid measurements are a random 
sample from the normal distribution with unknown mean 
and unknown variance o”. Suppose that we wish to test 
the following hypotheses: 


Ho: hs 1.2, 
Ay: u> 1.2, 


a. Perform the level aj = 0.05 test of these hypotheses. 
b. Compute the p-value. 


2. Suppose that nine observations are selected at random 
from the normal distribution with unknown mean jy and 
unknown variance o?, and for these nine observations it 
is found that X,, = 22 and )~"_,(X; — X,)° =72. 


a. Carry out a test of the following hypotheses at the 
level of significance 0.05: 
Ho: = 20, 
A: u> 20. 


b. Carry out a test of the following hypotheses at the 
level of significance 0.05 by using the two-sided f test: 


Ho: h= 20, 
Ay: pL # 20. 


c. From the data, construct the observed confidence 
interval for jz with confidence coefficient 0.95. 


3. The manufacturer of a certain type of automobile 
claims that under typical urban driving conditions the au- 
tomobile will travel on average at least 20 miles per gallon 
of gasoline. The owner of this type of automobile notes 
the mileages that she has obtained in her own urban driv- 
ing when she fills her automobile’s tank with gasoline on 
nine different occasions. She finds that the results, in miles 
per gallon, are as follows: 15.6, 18.6, 18.3, 20.1, 21.5, 18.4, 
19.1, 20.4, and 19.0. Test the manufacturer’s claim by car- 
rying out a test at the level of significance ag = 0.05. List 
carefully the assumptions you make. 


4. Suppose that a random sample of eight observations 
X,,...,Xg is taken from the normal distribution with 
unknown mean jw and unknown variance o”, and it is 
desired to test the following hypotheses: 


Ao: L= 0, 
Ay: uU#0. 


Suppose also that the sample data are such that ey X;= 


—11.2 and yy x = 43.7. If a symmetric ¢ test is per- 
formed at the level of significance 0.10 so that each tail 
of the critical region has probability 0.05, should the hy- 
pothesis Ho be rejected or not? 
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5. Consider again the conditions of Exercise 4, and sup- 
pose again that a rf test is to be performed at the level of 
significance 0.10. Suppose now, however, that the ¢ test 
is not to be symmetric and the hypothesis Hp is to be re- 
jected if either U < cy or U > co, where Pr(U < c,) = 0.01 
and Pr(U > c7) = 0.09. For the sample data specified in Ex- 
ercise 4, should Ho be rejected or not? 


6. Suppose that the variables X,,..., X, form a random 
sample from the normal distribution with unknown mean 
w and unknown variance o”, and a t test at a given level 
of significance aq is to be carried out to test the following 
hypotheses: 


Hp: S bo, 
A: LL > Lo. 
Let z(, o?|8) denote the power function of this ¢ test, 
and assume that (1, a7) and (9, 05) are values of the 
parameters such that 
M1 — Ko _ M2 = Ko 
O71 Op 


Show that (14, 07/8) = 2(H2, 0518). 


7. Consider the normal distribution with unknown mean 
wand unknown variance o?, and suppose that it is desired 
to test the following hypotheses: 

Ho: S bo; 

Ay: LL > Lo. 
Suppose that it is possible to observe only a single value of 
X from this distribution, but that an independent random 
sample of n observations Y;,..., Y,, is available from the 
normal distribution with known mean 0 and the same 
variance o” as for X. Show how to carry out a test of the 
hypotheses Hp and H, based on the ¢ distribution with n 
degrees of freedom. 


8. Suppose that the variables X,,..., X, form a random 
sample from the normal distribution with unknown mean 
w and unknown variance o7. Let a be a given positive 
number, and suppose that it is desired to test the following 
hypotheses at a specified level of significance a (0 < ag < 
1): 


Let Ss = V1 — X,,)*, and suppose that the test pro- 
cedure to be used specifies that Hp should be rejected if 
Sie, >c. Also, let (1, 02/5) denote the power func- 
tion of this procedure. Explain how to choose the con- 
stant c so that, regardless of the value of j, the follow- 
ing requirements are satisfied: z(, o?|8) < aq if o2 < ae, 
2 


(pL, o2|5) = ao ifo? = q and (1, o2|5) > a ifo2> one 


9. Suppose that a random sample of 10 observations 
X,,..., X4q is taken from the normal distribution with 


unknown mean jw and unknown variance o”, and it is de- 
sired to test the following hypotheses: 


Ho: o2 < 4, 
Hy: 0% >4. 


Suppose that a test of the form described in Exercise 8 is 
to be carried out at the level of significance a = 0.05. If 
the observed value of hia is 60, should the hypothesis Ho 
be rejected or not? 


10. Suppose again, as in Exercise 9, that a random sample 
of 10 observations is taken from the normal distribution 
with unknown mean pu and unknown variance o”, but sup- 
pose now that the following hypotheses are to be tested at 
the level of significance 0.05: 


Ho: o2 = 4, 
Ay: o2 # 4. 
Suppose that the null hypothesis Ho is to be rejected if 


either s <cjor Ss > cy, where the constants c; and cp are 
to be chosen so that, when the hypothesis Hp is true, 


Pr(S: Se = Pris So) = 01025. 
Determine the values of c; and cp. 


11. Suppose that U; has the noncentral ¢ distribution with 
m degrees of freedom and noncentrality parameter y, and 
suppose that U> has the noncentral f distribution with 
m degrees of freedom and noncentrality parameter —y. 
Prove that Pr(U; > c) = Pr(U2 < —c). 


12. Suppose that a random sample X),..., X,, is to be 
taken from the normal distribution with unknown mean 
wand unknown variance o”, and the following hypotheses 
are to be tested: 


Ho: ws 3, 

Ay: > 3. 
Suppose also that the sample size n is 17, and it is found 
from the observed values in the sample that X,, = 3.2 and 


(1/n) 3-"_,(X; — X,,)* = 0.09. Calculate the value of the 
statistic U, and find the corresponding p-value. 


13. Consider again the conditions of Exercise 12, but sup- 
pose now that the sample size n is 170, and it is again found 
from the observed values in the sample that X,, = 3.2 and 
(1/n) 3-"_,(X; — X,)* = 0.09. Calculate the value of the 
statistic U and find the corresponding p-value. 


14. Consider again the conditions of Exercise 12, but sup- 
pose now that the following hypotheses are to be tested: 


Ho: LL =; 
Ay: [Lh #31, 


Suppose, as in Exercise 12, that the sample size n is 17, 
and it is found from the observed values in the sample that 


X,, = 3.2 and (1/n) )~"_,(X; — X,)” = 0.09. Calculate the 
value of the statistic U and find the corresponding p-value. 


15. Consider again the conditions of Exercise 14, but sup- 
pose now that the sample size n is 170, and it is again found 
from the observed values in the sample that X,, = 3.2 and 
(1/n) -"_,(X; — X,)* = 0.09. Calculate the value of the 
statistic U and find the corresponding p-value. 


16. Consider again the conditions of Exercise 14. Sup- 
pose, as in Exercise 14, that the sample size n is 17, but sup- 
pose now that it is found from the observed values in the 
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sample that X, = 3.0 and (1/n) )~"_,(X; — X,)* = 0.09. 
Calculate the value of the statistic U and find correspond- 
ing p-value. 


17. Prove that the likelihood ratio test for hypotheses 
(9.5.7) is the two-sided ¢ test that rejects Hp if |U| >c, 
where U is defined in Eq. (8.5.1). The argument is slightly 
simpler than, but very similar to, the one given in the text 
for the one-sided case. 


18. Prove that the likelihood ratio test for hypotheses 
(9.5.3) is to reject Ho if U <c, where U is defined in 
Eq. (8.5.1). 


Example 
9.6.1 


9.6 Comparing the Means of Two Normal 
Distributions 


Itis very common to compare two distributions to see which has the higher mean or 
just to see how different the two means are. When the two distributions are normal, 
the tests and confidence intervals based on the t distribution are very similar to the 
ones that arose when we considered a single distribution. 


The Two-Sample t Test 


Rain from Seeded Clouds. In Example 8.3.1, we were interested in whether or not the 
mean log-rainfall from seeded clouds was greater than 4, which we supposed to have 
been the mean log-rainfall from unseeded clouds. If we want to compare rainfalls 
from seeded and unseeded clouds under otherwise similar conditions, we would 
normally observe two random samples of rainfalls: one from seeded clouds and one 
from unseeded clouds but otherwise under similar conditions. We would then model 
these samples as being random samples from two different normal distributions, 
and we would want to compare their means and possibly their variances to see how 
different the distributions are. < 


Consider first a problem in which random samples are available from two normal 
distributions with common unknown variance, and it is desired to determine which 
distribution has the larger mean. Specifically, we shall assume that X¥ = (Xj, ..., Xm) 
form a random sample of m observations from a normal distribution for which both 
the mean j1; and the variance o” are unknown, and that Y = (Y;,..., Y,,) form an 
independent random sample of n observations from another normal distribution for 
which both the mean 17 and the variance o? are unknown. We will then be interested 
in testing hypotheses such as 


Ho: 4 <2 «versus Ay: fy > [. (9.6.1) 


For each test procedure 6, we shall let 2(j11, 42, 07/6) denote the power function of 6. 
We shall assume that the variance o? is the same for both distributions, even though 
the value of o? is unknown. If this assumption seems unwarranted, the two-sample 
t test that we shall derive next would not be appropriate. A different test procedure 
is discussed later in this section for the case in which the two populations might have 
different variances. Later in this section, we shall derive the likelihood ratio test. 
In Sec. 9.7, we discuss some procedures for comparing the variances of two normal 
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Theorem 
9.6.1 


distributions, which includes testing the null hypothesis that the variances are the 
same. 

Intuitively, it makes sense to reject Hp in (9.6.1) if the difference between the 
sample means is large. Theorem 9.6.1 derives the distribution of a natural test statistic 
to use. 


Two-Sample t Statistic. Assume the structure described in the preceding paragraphs. 
Define 


m 


= 1 = 1 i 

Xm=— >) Xi, r= >) 
m j=l ae 
m n 

S= >) (=x), and S=) =F). (9.6.2) 
i=l i=l 


Define the test statistic 


= (m n= ee ~ i) ; 


eo ae 2y1/2 
See) eds 
(- ~) (Sx y) 


(9.6.3) 


For all values of 6 = (14, 42, 07) such that p41 = 2, the distribution of U is the t 
distribution with m +n — 2 degrees of freedom. 


Proof Assume that j4) = 2. Define the following two random variables: 


Xm — Y, 
Z= ae Te (9.6.4) 
(G+s) 
eae oe oO 
m n 
S + St 
w=. (9.6.5) 
oO 
The statistic U can now be represented in the form 
ge . (9.6.6) 


[W/(m +n — 2)]}'/2" 


The remainder of the proof consists of proving that Z has the standard normal 
distribution, that W has the x? distribution with m + n — 2 degrees of freedom, and 
that Z and W are independent. The result then follows from Definition 8.4.1, the 
definition of the family of tf distributions. 

We have assumed that X and Y are independent given @. It follows that every 
function of X is independent of every function of Y. In particular, (X,,, So) is 
independent of (Y,, S?). By Theorem 8.3.1, X,, and S% are independent, and Y,, 
and Be are also independent. It follows that all four of X,, Yn. Ses and se are 
mutually independent. Hence, Z and W are also independent. It also follows from 
Theorem 8.3.1 that So /o* and Ss /o* have, respectively, the x? distributions with 
m — 1andn — 1degrees of freedom. Hence, W is the sum of two independent random 
variables with x? distributions and so has the x” distribution with the sum of the 
two degrees of freedom, namely, m +n — 2. X,, — Y,, has the normal distribution 
with mean jz, — 4) = 0 and variance o7/n + o7/m. It follows that Z has the standard 
normal distribution. a 


Theorem 
9.6.2 


Theorem 
9.6.3 


Example 
9.6.2 
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A two-sample ¢ test with level of significance ap is the procedure 6 that rejects Hp 
ifU> Tl — ay). Theorem 9.6.2 states some useful properties of two-sample t 
tests analogous to those of Theorem 9.5.1. The proof is so similar to that of Theo- 


rem 9.5.1 that we shall not present it here. 


Level and Unbiasedness of Two-Sample t Tests. Let 6 be the two-sample t test defined 
above. The power function z(j11, [47, 7/6) has the following properties: 


i. (M4, 2, 67|5) = og when py = [L2, 

ii, (My, 2, 07|8) <a when py < fp, 

iii, (14, 2, 07|5) > a When pL > [Lp, 

iV. (My, U2, 07/5) > 0 as wy — Wy > —00, 
V. (U4, fg, 07/5) > Las wy — fy > 00. 


Furthermore, the test 6 has size a and is unbiased. a 


Note: The Other One-Sided Hypotheses. If the hypotheses are 
Ho: fy > 2 versus Ay: hy < fp, (9.6.7) 


the corresponding level ap t test is to reject Hy) when U < -Tyt,( — a). This test 
has properties analogous to those of the other one-sided test. 

P-values are computed in much the same way as they were for the one-sample t 
test. The proof of Theorem 9.6.3 is virtually the same as the proof of Theorem 9.5.2 


and is not given here. 


p-Values for Two-Sample t Tests. Suppose that we are testing either the hypotheses 
in Eq. (9.6.1) or the hypotheses in Eq. (9.6.7). Let u be the observed value of the 
statistic U in Eq. (9.6.3), and let T,,4,,—2(-) be the c.d.f. of the ¢ distribution with 
m +n — 2 degrees of freedom. Then the p-value for the hypotheses in Eq. (9.6.1) is 
1 = Ty 4n—-2(u) and the p-value for the hypotheses in Eq. (9.6.7) is Tj) 4,—2(u). rT] 


Rain from Seeded Clouds. In Example 9.6.1, we actually have 26 observations of 
unseeded clouds to go with the 26 observations of seeded clouds. Let Xj, ..., Xr 
be the log-rainfall measurements from the seeded clouds, and let Y;,..., Yo be 
the measurements from the unseeded clouds. We model all of the measurements as 
independent with the X;’s having the normal distribution with mean jz, and variance 
o”, and the Y;’s having the normal distribution with mean jz and variance o7. For 
now, we model the two distributions as having a common variance. Suppose that 
we wish to test whether or not the mean log-rainfall from seeded clouds is larger 
than the mean log-rainfall from unseeded clouds. We choose the null and alternative 
hypotheses so that type I error corresponds to claiming that seeding increases rainfall 
when, in fact, it does not increase rainfall. That is, the null hypothesis is Ho : 44 < M2 
and the alternative hypothesis is Hj, : 41 > 42. We choose a level of significance of 
ay = 0.01. Before proceeding with the formal test, it is a good idea to look at the 
data first. Figure 9.15 contains histograms of the log-rainfalls of both seeded and 
unseeded clouds. The two samples look different, with the seeded clouds appearing 
to have larger log-rainfalls. The formal test requires us to compute the statistics 


Reais, “¥,= 399, 
Sj, = 63.96, and Sy = 67.39. 
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Figure 9.15 Histograms of 
seeded and unseeded clouds 
in Example 9.6.2. 


Example 
9.6.3 


Theorem 
9.6.4 


A 
8+ 8 
6+ 6 
= =| 
8 3 
O.45- 04 
2-- 2 
t oa aa 
0} 2 4 6 8 OJ 2 4 6 8 
Unseeded Seeded 


The critical value is T='(0.99) = 2.403, and the test statistic is 
1/2 _ 
y= 50°/-(5.13 — 3.99) 9.544, 


,.. ty" i 
ae oes 63.96 + 67.39 
(= + =) (63.96 + 67.39) 


which is greater than 2.403. So, we would reject the null hypothesis at level of 
significance a = 0.01. The p-value is the smallest level at which we would reject Ho, 
namely, 1 — T59(2.544) = 0.007. < 


Roman Pottery in Britain. Tubb, Parker, and Nickless (1980) describe a study of 
samples of pottery from the Roman era found in various locations in Great Britain. 
One measurement made on each sample of pottery was the percentage of the sample 
that was aluminum oxide. Suppose that we are interested in comparing the aluminum 
oxide percentages at two different locations. There were m = 14 samples analyzed 
from Llanederyn, with sample average of X,, = 12.56 and Se = 24.65. Another n = 5 
samples came from Ashley Rails, with Y,, = 17.32 and Se = 11.01. One of the sample 
sizes is too small for the histogram to be very illuminating. Suppose that we model the 
data as normal random variables with two different means jz, and yy but common 
variance o7. We want to test the null hypothesis Ho : 11 > jz against the alternative 
hypothesis H : 41 < 42. The observed value of U defined by Eq. (9.6.3) is —6.302. 
From the table of the f distribution in this book, with m +n —2 = 17 degrees of 
freedom, we find that T 77 (0.995) = 2.898 and U < —2.898. So, we would reject 
Hp at any level ap > 0.005. Indeed, the p-value associated with this value of U is 
T7(—6.302) = 4 x 10°. < 


Power of the Test 


For each parameter vector @ = (j11, >, 07), the power function of the two-sample 
t test can be computed using the noncentral ¢ distribution introduced in Defini- 
tion 9.5.1. Almost identical reasoning to that which led to Theorem 9.5.3 proves the 
following. 


Power of Two-Sample t Test. Assume the conditions stated earlier in this section. Let 
U be defined in Eq. (9.6.6). Then U has the noncentral r distribution with m +n — 2 
degrees of freedom and noncentrality parameter 


Example 
9.6.4 


Example 
9.6.5 
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_ M1 — B2 

‘ar iy" (9.6.8) 
o{—+-— 
mon 


We can use Fig. 9.12 on page 580 to approximate power calculations if we do not 
have an appropriate computer program handy. 


Roman Pottery in Britain. In Example 9.6.3, if the Llanederyn mean is less than the 
Ashley Rails mean by 1.50, then |w|=1.5/(1/14 + 1/5)'/? = 2.88. The power of a level 
0.01 test of Hp : 4 > 42 appears to be about 0.65 in the right panel of Fig. 9.12. (The 
actual power is 0.63.) < 


Two-Sided Alternatives 


The two-sample ¢ test can easily be adapted to testing the following hypotheses at a 
specified level of significance ap: 


Ao: 1, = bo, versus Hy: py A po. (9.6.9) 


The size ag two-sided f test rejects Hp if |U| >c where c= Tncall — a/2), and 
the statistic U is defined in Eq. (9.6.3). The p-value when U = u is observed equals 


2[1 = Tipin—o(luel)]. (See Exercise 9.) 


Comparing Copper Ores. Suppose that a random sample of eight specimens of ore 
is collected from a certain location in a copper mine, and the amount of copper in 
each of the specimens is measured in grams. We shall denote these eight amounts 
by Xj,..., Xg and shall suppose that the observed values are such that Xz = 2.6 
and 5 = 0.32. Suppose also that a second random sample of 10 specimens of ore is 
collected from another part of the mine. We shall denote the amounts of copper in 
these specimens by Yj, ..., Yj) and shall suppose that the observed values in grams 
are such that Yj) = 2.3, and Se = 0.22. Let jz; denote the mean amount of copper in 
all the ore at the first location in the mine, let 4. denote the mean amount of copper 
in all the ore at the second location, and suppose that the hypotheses (9.6.9) are to 
be tested. 

We shall assume that all the observations have a normal distribution, and the 
variance is the same at both locations in the mine, even though the means may be 
different. In this example, the sample sizes are m = 8 and n = 10, and the value of 
the statistic U defined by Eq. (9.6.3) is 3.442. Also, by the use of a table of the t 
distribution with 16 degrees of freedom, it is found that T jg (0.995) = 2.921, so that 
the tail area corresponding to this observed value of U is less than 2 x 0.005. Hence, 
the null hypothesis will be rejected for any specified level of significance ag > 0.01. 
(In fact, the two-sided tail area associated with U = 3.442 is 0.003.) < 


The power function of the two-sided two-sample f test is based on the noncentral 
t distribution in the same way as was the power function of the one-sample two-sided 
t test. The test 6 that rejects Hp : w; = 2 when |U| > c has power function 


(11, 12, 07/8) = Tn4n—2(—elW) + 1 — Tingn_2(cl), 


where T,,,.,-2(-|W) is the c.d.f. of the noncentral ¢ distribution with m + n — 2 degrees 
of freedom and noncentrality parameter y given in Eq. (9.6.8). Figure 9.14 on 
page 583 can be used to approximate the power function if appropriate software 
is not available. 
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The Two-Sample t Test as a Likelihood Ratio Test 


In this section, we shall show that the two-sample t test for the hypotheses (9.6.1) is 
a likelihood ratio test. After the values x1,...,x,, and y;,..., y, inthe two samples 
have been observed, the likelihood function g(x, y|4, (4, o”) is 


g(x, YIM, M2, 0°) = fin X11, 07) fa Yl, 07). 


Here, both f,,(x|1, 07) and f,( y|/42, 07) have the form given in Eq. (9.5.9), and the 
value of o” is the same in both terms. In this case, Qo = {(/44, 12, 0) Wy < [o}. The 
likelihood ratio statistic is 

SUP ((y14,2,02)1 SH} 8% YIM M2, 0”) 


A(x, y) = 
SUP (114, 115,02) 8% Vie» Ha, 07) 


(9.6.10) 


The likelihood ratio test procedure then specifies that Hp should be rejected if 
A(x, y) <k, where k is typically chosen so that the test has a desired level ap. 
To facilitate the maximizations in (9.6.10), let 


m 


n 
2 ae) 2 ey) 
= xer —X,)°, and sS= xXer — y,)°. 
i=l i=l 
Then we can write 


g(x, YIMy, M2, 0°) 


1 _ i 
exp 5) [me ear + ny, a fla)” + te 7 | : 
20 : 


= (2202) (mtn)/2 
The denominator of (9.6.10) is maximized by the overall M.L.E.’s, that is, when 
_ = 1 
Ly =Xm> M2 =Yn, and o7= (s? +52). (9.6.11) 
m+n s 


For the numerator of (9.6.10), when X,, < y,,, the parameter vector in (9.6.11) is in Qo, 
and hence the maximum also occurs at the values in Eq. (9.6.11). Hence, A(x, y) = 1 

For the other case, when X,,, > y,, it is not difficult to see that 1 = 22 is required 
in order to achieve the maximum. In these cases, the maximum occurs when 


My =h2>= ah 
mn(xX,, — Y, )2/(m +n) sg? sh g® 
2 m Yn x y 
o= ‘ 
m+n 


Substituting all of these values into (9.6.10) yields 


1 if Xin < Vn 
A x, = n is 
¢ y) | al a v) (m+n) /2 if FS Vas 
where 
v= Gn ~ Yn) (9.6.12) 
ft . 1" 
mon y 


If k <1, itis straightforward to show that A(x, y) < k is equivalent to v > k’ for some 
other constant k’. Finally, note that (m+n — 2)'/2v is the observed value of U, so 
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the likelihood ratio test is to reject Hy when U > c, for some constant c. This is the 
same as the two-sample ft test. The preceding argument can easily be adapted to 
handle the other one-sided hypotheses and the two-sided case. (See Exercise 13 for 
the two-sided case.) 


>, 
“ 


Unequal Variances 


Known Ratio of Variances The t test can be extended to a problem in which the 
variances of the two normal distributions are not equal but the ratio of one variance 
to the other is known. Specifically, suppose that X,,..., X,, form a random sample 
from the normal distribution with mean jz, and variance a and Y;,..., Y, form 
an independent random sample from another normal distribution with mean jz and 
variance oe. Suppose also that the values of (44, (2, a and 05 are unknown but that 
ie = ko?, where k is a known positive constant. Then it can be shown (see Exercise 4 
at the end of this section) that when j11 = /49, the following random variable U will 
have the ¢ distribution with m + n — 2 degrees of freedom: 


_ 912% _y 
y — etn = 2) Xn = Yn) (9.6.13) 


1/2° 
tt, ey fg: Se : 
—+- Sy + — 
m on k 


Hence, the statistic U defined by Eq. (9.6.13) can be used for testing either the 
hypotheses (9.6.1) or the hypotheses (9.6.9). 


The Behrens-Fisher Problem If the values of all four parameters 14, /12, a, and a5 
are unknown, and if the value of the ratio ot / a. is also unknown, then the problem of 
testing the hypotheses (9.6.1) or the hypotheses (9.6.9) becomes very difficult. Even 
the likelihood ratio statistic A has no known distribution. This problem is known 
as the Behrens-Fisher problem. Some simulation methods for the Behrens-Fisher 
problem will be described in Chapter 12 (Examples 12.2.4 and 12.6.10). Various other 
test procedures have been proposed, but most of them have been the subject of 
controversy in regard to their appropriateness or usefulness. The most popular of 
the proposed methods was developed in a series of articles by Welch (1938, 1947, 
1951). Welch proposed using the statistic 


V= Xm Yn (9.6.14) 


82 s \? 
m(m — 1) + n(n — 1) 


Even when 4; = /42, the distribution of V is not known in closed form. However, 
Welch approximated the distribution of V by a ¢ distribution as follows. Let 
2 2 
= Sx Sy 
m(m—1) n(n—1)’ 


and approximate the distribution of W by a gamma distribution with the same mean 
and variance as W. (See Exercise 12.) If we were now to assume that W actually had 
this approximating gamma distribution, then V would have the f distribution with 


(9.6.15) 
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9.6.6 


degrees of freedom 


2 2\2 

o o 

m n 
1 o2 ” 1 a5 : 
m—-1\m n—-1\n 


Next, substitute the unbiased estimates Se] (m — 1) and ree (n — 1) for oe and a5, 
respectively, in (9.6.16) to obtain the degrees of freedom for Welch’s ¢ distribution 


approximation: 
2 
i in i 
mm—1) nin—1) 


v= (9.6.17) 


1 (s2\? i, fy 
a (2) a) 


In Eq. (9.6.17), ae and - are the observed values of Se and s. To summarize Welch’s 
procedure, act as if V in Eq. (9.6.14) had the ¢ distribution with v degrees of freedom 
when 41 = >. Tests of one-sided and two-sided hypotheses are then constructed by 
comparing V to various quantiles of the ¢ distribution with v degrees of freedom. If 
v is not an integer, round it to the nearest integer or use a computer program that 
can handle ¢ distributions with noninteger degrees of freedom. 


(9.6.16) 


Comparing Copper Ores. Using the data from Example 9.6.5, we compute 
2.6 — 2.3 


( 0.32 0.22 y 7 
8x7 10x9 
(22 0.22 ) 
+ 
8x7 10x9 

7 5 = 12.4 

1 (=) a 1 (“) 

P 8 93 \ 10 

The p-value associated with the observed data for the hypotheses (9.6.9) is 2[1 — 


T12.49(3.321)] = 0.0058, not much different than what we obtained in Example 9.6.5. 
< 


3.321, 


Likelihood Ratio Test An alternative to the Welch approximation described above 
would be to apply the large-sample approximation of Theorem 9.1.4. Using the same 
notation as earlier in the section, we can write the likelihood function as 


g(X, YIM, M2, Of, 95) (9.6.18) 
1 M(Xm — bu)? eg ie Yn = fy) at sy 
= infin Ae 2 2 
(2n0;)"/*(2105) 20; 205 
The overall M.L.E.’s are 
s2 s? 
fy =Xm> i) =n 6; = = a5 = =. (9.6.19) 
m n 
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Under Ho : 41 = 42, we cannot find formulas for the M.L.E.’s. However, if we let 
stand for the common value of (1; = fi, we find that the M.L.E.’s satisfy the following 
equations: 


‘ 1 = A 
6p =— [s? + mG, — i"). (9.6.20) 
1 
ee dl ~~ _ Ay2 
[s? +n, — 2) | : (9.6.21) 
MX ny, NY y 
+ 
~ Ff  % 
i=—,—a (9.6.22) 
aor 


These equations can be solved recursively even though we do not have a closed-form 
solution. One algorithm is the following: 


1. Set k =O and pick a starting value (i, such as (mx, + ny,)/(m +n). 

2. Compute 6; ) and 65 “) by substituting into Eqs. (9.6.20) and (9.6.21). 
3. Compute f+ by substituting 6; and 6; into Eq. (9.6.22). 
4 


If («+ is close enough to A stop. Otherwise, replace k by k + 1 and return 
to step 2. 


Comparing Copper Ores. Using the data in Example 9.6.5, we will start with 4 = 
(8 x 2.6 + 10 x 2.3)/18 = 2.433. Plugging this value into Eqs. (9.6.20) and (9.6.21) 
gives us ie = 0.068 and a = 0.0398. Plugging these into Eq. (9.6.22) gives ji) = 
2.396. After 13 iterations the values stop changing and our final M.L.E.’s are 4 = 
2.347, 67 = 0.1039, and 67 = 0.0242. We can then substitute these M.L.E.’s into the 
likelihood function (9.6.18) to get the numerator of the likelihood ratio statistic 
A(x, y). (Remember to substitute / for both jz; and 22.) We can also substitute the 
overall M.L.E.’s (9.6.19) into (9.6.18) to get the denominator of A(x, y). The result is 
A(x, y) = 0.01356. Theorem 9.1.4 says that we should compare —2 log A(x, y) = 8.602 
to a critical value of the x? distribution with one degree of freedom. The p-value 
associated with the observed statistic is the probability that a x? random variable 
with one degree of freedom is greater than 8.602, namely, 0.003. This is the same 
as the p-value that we obtained in Example 9.6.5 when we assumed that the two 
variances were the same. < 


For the cases of one-sided hypotheses such as (9.6.1) and (9.6.7), the likelihood 
ratio statistic is a bit more complicated. For example, if w= 42, —2 log A(X, Y) 
converges in distribution to a distribution that is neither discrete nor continuous. We 
will not discuss this case further in this book. 


we 


¢ 


Summary 


Suppose that we observe independent random samples from two normal distribu- 
tions: Xj,..., X,, having mean j, and variance a and Y;,..., Y, having mean 
ty and variance ae. For testing hypotheses about jz; and 29, ¢ tests are available 
if we assume that ot = a5. The f tests all make use of the statistic U defined in 
Eq. (9.6.3). To test Ho : 4, = fy versus Hy: 41 4 [Lz at level ag, reject Hy if |U| = 
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Tl —ay/2), where T,,,,_, is the quantile function of the ¢ distribution with 
m+n —2 degrees of freedom. To test Hp : “4 < fo versus Hy: 41 > [2 at level ap, 
reject Hy if U > T), »(1— a). To test Hy : uy = Me versus Hy: (11 < [2 at level ao, 


reject Hp if U < —T,), — ay). The power functions of these tests can be com- 


puted using the family of noncentral t distributions. Approximate tests are available 


if we do not assume that a7 =o2, 


Exercises 


1. In Example 9.6.3, we discussed Roman pottery found 
at two different locations in Great Britain. There were 
samples found at other locations as well. One other lo- 


cation, Island Thorns, had five samples X,,..., X,, with 
an average aluminum oxide percentage of X¥ = 18.18 with 
> _(X; — X)? = 12.61. Let ¥j,..., Ys be the five sam- 


ple measurements from Ashley Rails in Example 9.6.3. 
Test the null hypothesis that the mean aluminum oxide 
percentages at Ashely Rails and Island Thorns are the 
same versus the alternative that they are different at level 
ay = 0.05. 


2. Suppose that a certain drug A was administered to eight 
patients selected at random, and after a fixed time period, 
the concentration of the drug in certain body cells of each 
patient was measured in appropriate units. Suppose that 
these concentrations for the eight patients were found to 
be as follows: 


1.23, 1.42, 1.41, 1.62, 1.55, 1.51, 1.60, and 1.76. 


Suppose also that a second drug B was administered to 
six different patients selected at random, and when the 
concentration of drug B was measured in a similar way 
for these six patients, the results were as follows: 


1.76, 1.41, 1.87, 1.49, 1.67, and 1.81. 


Assuming that all the observations have a normal distribu- 
tion with a common unknown variance, test the following 
hypotheses at the level of significance 0.10: The null hy- 
pothesis is that the mean concentration of drug A among 
all patients is at least as large as the mean concentration of 
drug B. The alternative hypothesis is that the mean con- 
centration of drug B is larger than that of drug A. 


3. Consider again the conditions of Exercise 2, but sup- 
pose now that it is desired to test the following hypotheses: 
The null hypothesis is that the mean concentration of drug 
A among all patients is the same as the mean concentration 
of drug B. The alternative hypothesis, which is two-sided, 
is that the mean concentrations of the two drugs are not 
the same. Find the number c so that the level 0.05 two- 
sided t test will reject Hy when |U| > c, where U is defined 
by Eq. (9.6.3). Also, perform the test. 


4. Suppose that X;,..., X,, form a random sample from 


the normal distribution with mean jz; and variance oF, and 


2. 


Y,,..., Y, form an independent random sample from the 
normal distribution with mean j22 and variance oe, Show 
that if wy = > and Os a kot, then the random variable U 
defined by Eq. (9.6.13) has the ¢ distribution with m +n — 
2 degrees of freedom. 


5. Consider again the conditions and observed values of 
Exercise 2. However, suppose now that each observation 
for drug A has an unknown variance cee and each obser- 
vation for drug B has an unknown variance ie, but it is 
known that a5 = (6/5)o%. Test the hypotheses described 
in Exercise 2 at the level of significance 0.10. 


6. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean j2; and un- 
known variance o7, and Yj, ..., Y,, form an independent 
random sample from another normal distribution with un- 
known mean j1 and the same unknown variance o?. For 
each constant 4. (—oo < 4 < oo), construct at test of the 
following hypotheses with m + n — 2 degrees of freedom: 


Ao: by — M2 =A, 
Ay. by — Ug FA. 


7. Consider again the conditions of Exercise 2. Let 4 
denote the mean of each observation for drug A, and 
let 42 denote the mean of each observation for drug B. 
It is assumed, as in Exercise 2, that all the observations 
have a common unknown variance. Use the results of 
Exercise 6 to construct a confidence interval for 44 — “2 
with confidence coefficient 0.90. 


8. In Example 9.6.5, determine the power of a level 0.01 
test if |uy — ol =o. 


9. Suppose that we wish to test the hypotheses (9.6.9). We 
shall use the statistic U defined in Eq. (9.6.3) and reject 
Ho if |U| is large. Prove that the p-value when U = u is 
observed is 2[1 — Ty 4,-2(\ul)]. 


10. Lyle et al. (1987) ran an experiment to study the ef- 
fect of a calcium supplement on the blood pressure of 
African American males. A group of 10 men received a 
calcium supplement, and another group of 11 men re- 
ceived a placebo. The experiment lasted 12 weeks. Both 
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Table 9.2 Blood pressure data for Exercise 10 


Calcium 7 —4 


Placebo —1 12 


before and after the 12-week period, each man had his sys- 
tolic blood pressure measured while at rest. The changes 
(after minus before) are given in Table 9.2. Test the null 
hypothesis that the mean change in blood pressure for the 
calcium supplement group is lower than the mean change 
in blood pressure for the placebo group. Use level ag = 0.1. 


11. Frisby and Clatworthy (1975) studied the times that it 
takes subjects to fuse random-dot stereograms. Random- 
dot stereograms are pairs of images that appear at first to 
be random dots. After a subject looks at the pair of images 
from the proper distance and her eyes cross just the right 
amount, a recognizable object appears from the fusion of 
the two images. The experimenters were concerned with 
the extent to which prior information about the recogniz- 
able object affected the time it took to fuse the images. 
One group of 43 subjects was not shown a picture of 
the object before being asked to fuse the images. Their 
average time was X43 = 8.560 and se = 2745.7. The sec- 
ond group of 35 subjects was shown a picture of the ob- 
ject, and their sample statistics were Y35 = 5.551 and se a 
783.9. The null hypothesis is that the mean time of the 


first group is no larger than the mean time of the sec- 
ond group, while the alternative hypothesis is that the first 
group takes longer. 


a. Test the hypotheses at the level of significance aj = 
0.01, assuming that the variances are equal for the 
two groups. 


b. Test the hypotheses at the level of significance ag = 
0.01, using Welch’s approximate test. 


12. Find the mean a and variance b of the random vari- 
able W in Eq. (9.6.15). Now, let a and b be the mean and 
variance, respectively, of the gamma distribution with pa- 
rameters a and f. Prove that 2a equals the expression in 
(9.6.16). 


13. Let U be as defined in Eq. (9.6.3), and suppose that 
it is desired to test the hypotheses in Eq. (9.6.9). Prove 
that each likelihood ratio test has the following form: re- 
ject Ho if |U| > c, where c is a constant. Hint: First prove 
that A(x, y)= (+ y2)7+)/2 where v was defined in 
Eq. (9.6.12). 


9.7 TheF Distributions 


In this section, we introduce the family of F distributions. This family is useful in 
two different hypothesis-testing situations. The first situation is when we wish to 
test hypotheses about the variances of two different normal distributions. These 
tests, which we shall derive in this section, are based on a Statistic that has an F 
distribution. The second situation will arise in Chapter 11 when we test hypotheses 
concerning the means of more than two normal distributions. 


Definition of the F Distribution 


Example 
9.7.1 


Rain from Seeded Clouds. In Example 9.6.1, we were interested in comparing the 
distributions of log-rainfalls from seeded and unseeded clouds. In Example 9.6.2, we 
used the two-sample rf test to compare the means of these distributions under the 
assumption that the variances of the two distributions were the same. It would be 
good to have a procedure for testing whether or not such an assumption is warranted. 

<J 


In this section, we shall introduce a family of distributions, called the F distribu- 
tions, that arises in many important problems of testing hypotheses in which two or 
more normal distributions are to be compared on the basis of random samples from 


598 


Chapter 9 Testing Hypotheses 


Definition 
9.7.1 


Theorem 
9.7.1 


Theorem 
9.7.2 


Example 
9.7.2 


each of the distributions. In particular, it arises naturally when we wish to compare 
the variances of two normal distributions. 


The F distributions. Let Y and W be independent random variables such that Y has 

the x? distribution with m degrees of freedom and W has the x? distribution with n 

degrees of freedom, where m and n are given positive integers. Define a new random 
variable X as follows: 

foe (9.7.1) 

W/n mW 


Then the distribution of X is called the F distribution with m and n degrees of freedom. 


Theorem 9.7.1 gives the general p.d.f. of an F distribution. Its proof relies on the 
methods of Sec. 3.9 and will be postponed until the end of this section. 


Probability Density Function. Let X have the F distribution with m and n degrees of 
freedom. Then its p.d.f. f(x) is as follows, for x > 0: 


il 
Pr EG r | ils x (m/2)-1 


(m-+n)/2’ 
r(Enyr(Eny em 
2 2 


Properties of the F Distributions 


fa)= 


(9.7.2) 


and f(x) =0 for x <0. 


When we speak of the F distribution with m and n degrees of freedom, the order in 
which the numbers m and n are given is important, as can be seen from the definition 
of X in Eq. (9.7.1). When m £n, the F distribution with m and n degrees of freedom 
and the F distribution with n and m degrees of freedom are two different distribu- 
tions. Theorem 9.7.2 gives a result relating the two distributions just mentioned along 
with a relationship between F distributions and ¢ distributions. 


If X has the F distribution with m and n degrees of freedom, then its reciprocal 1/X 
has the F distribution with n and m degrees of freedom. If Y has the r distribution 
with n degrees of freedom, then Y* has the F distribution with 1 and n degrees of 
freedom. 


Proof The first statement follows from the representation of X as the ratio of two 
random variables, in Definition 9.7.1. The second statement follows from the repre- 
sentation of a t random variable in the form of Eq. (8.4.1). rT] 


Two short tables of quantiles for F distributions are given at the end of this book. 
In these tables, we give only the 0.95 quantile and the 0.975 quantile for different 
possible pairs of values of m and n. In other words, if G denotes the c.d.f. of the F 
distribution with m and n degrees of freedom, then the tables give the values of x, and 
x, such that G(x;) = 0.95 and G(x) = 0.975. By applying Theorem 9.7.2, it is possible 
to use the tables to obtain the 0.05 and 0.025 quantiles of an F distribution. Most 
statistical software will compute the c.d.f. and quatiles for general F distributions. 


Determining the 0.05 Quantile of an F Distribution. Suppose that a random variable 
X has the F distribution with 6 and 12 degrees of freedom. We shall determine the 
0.05 quantile of X, that is, the value of x such that Pr(X < x) = 0.05. 


Definition 
9.7.2 


Theorem 
9.7.3 
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If we let Y =1/X, then Y will have the F distribution with 12 and 6 degrees 
of freedom. It can be found from the table given at the end of this book that 
Pr(Y < 4.00) = 0.95; hence, Pr(Y > 4.00) = 0.05. Since Y > 4.00if and only if X < 0.25, 
it follows that Pr(X < 0.25) = 0.05. Because F distributions are continuous, Pr(X < 
0.25) = 0.05, and 0.25 is the 0.05 quantile of X. <l 


Comparing the Variances of Two Normal Distributions 


Suppose that the random variables X,, ..., X,, form arandom sample of m observa- 

tions from a normal distribution for which both the mean jz; and the variance ot are 

unknown, and suppose also that the random variables Y;, ..., Y,, form an indepen- 

dent random sample of n observations from another normal distribution for which 

both the mean jy and the variance o. are unknown. Suppose finally that the following 

hypotheses are to be tested at a specified level of significance ag (0 < ag < 1): 
Hy. o2< o2, 

ae (9.7.3) 

Ay: of > 05. 

For each test procedure 5, we shall let m(444, fo, a. 56) denote the power 

function of 5. Later in this section, we shall derive the likelihood ratio test. For now, 

define Ss and ie to be the sums of squares defined in Eq. (9.6.2). Then SZ/(m - 

and Se /(n — 1) are estimators of ot and ae, respectively. It makes intuitive sense that 


we should reject Ho if the ratio of these two estimators is large. That is, define 
S¥/(m—1 

= a (9.7.4) 
S$/(n— 1) 


and reject Hp if V >c, where c is chosen to make the test have a desired level of 
significance. 


F test. The test procedure defined above is called an F test. 


Properties of F Tests 


Distribution of V. Let V be the statistic in Eq. (9.7.4). The distribution of (03 /07)V is 
the F distribution with m — 1 and n — 1 degrees of freedom. In particular, if a7 = a, 
then the distribution of V itself is the F distribution with m — 1 and n — 1 degrees of 
freedom. 


Proof We know from Theorem 8.3.1 that the random variable S jot has the x? 


distribution with m — 1 degrees of freedom, and the random variable S, / Os has the x? 
distribution with n — 1 degrees of freedom. Furthermore, these two random variables 
are independent, since they are calculated from two independent samples. Therefore, 
the following random variable V* has the F distribution with m — 1andn — 1 degrees 
of freedom: 


eo S}/[(m — oF] 
S2./[( — 1)o3] 


It can be seen from Eqs. (9.7.4) and (9.7.5) that V* = (05 /02)V. This proves the first 
claim in the theorem. If ot = co then V = V*, which proves the second claim. 4m 


(9.7.5) 
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Theorem 
9.7.4 


Example 
9.7.3 


If o; = ae, it is possible to use a table of the F distribution to choose a constant c 
such that Pr(V > c) = ag, regardless of the common value of ot and aa, and regardless 
of the values of jz, and jz». In fact, c will be the 1 — ag quantile of the corresponding F 
distribution. We prove next that the test that rejects Hp in (9.7.3) if V > c has level ag. 


Level, Power Function, and P-Values. Let V be the statistic defined in Eq. (9.7.4). Let c 
be the 1 — ap quantile of the F distribution with m — 1 and n — 1 degrees of freedom, 
and let G,,_1,,—1 be the c.d.f. of that F distribution. Let 6 be test that rejects Hp in 
(9.7.3) when V > c. The power function 2 (14, (42, a 05/8) satisfies the following 
properties: 


Ph 

: 2, ed Oe 

1. W(t, M2, 071,95 6) =1- Gim—1,n-1 (Se), 
1 


+4 2:2 2 2 
il. 1(U4, M2, Of, 05/5) = a9 when of =05, 


re 2 2 2 sD 
iil. 1((1, M2, 7, 05/6) < ag when o; <o5, 


. 2 xed, 2 2 
iv. (M4, 2, Of, 05/5) > ap when of > 05, 


2 2 27/2 
V. (4, M2, Of, 05/6) > Vas o7/o5 > 0, 


Vi. T(fL4, Lo, or, oe 5) > las ot/o5 —> 00. 


The test 6 has level ap and is unbiased. The p-value when V = v is observed equals 
I= Gim—1,n—1(0). 


Proof The power function is the probability of rejecting Ho, 1.e., the probability that 
V >c. Let V* be as defined in Eq. (9.7.5) so that V* has the F distribution with m — 1 
and n — 1 degrees of freedom. Then 


2 22 oF o5 
(fq, 2, 07, 05|6) = Pr(V >c) =Pr| —V*>c] =Pr| V* > —c 
1a: 2 o3 o? 


o2 
=1- Graig ec > (9.7.6) 
oO 


1 


which proves property (i). Property (ii) follows from Theorem 9.7.3. For property 
(iii), let a; < oF in Eq. (9.7.6). Since (03 /07)c > c, the expression on the far right 
of (9.7.6) is less than 1 — G,,_1.,_1(c) =a. Similarly, if a; > Oo, the expression on 
the far right of (9.7.6) is greater than 1— G,,_1,,_1(c) =@o, proving property (iv). 
Properties (v) and (vi) follow from property (i) and elementary properties of c.d.f’s, 
namely, Property 3.3.2. The fact that 6 has level ap follows from properties (ii) and 
(iii). The fact that 6 is unbiased follows from properties (ii) and (iv). Finally, the p- 
value is the smallest ag such that we would reject Ho at level ag if V = v were observed. 
We reject Ho at level ag if and only if v > Gy n_ Cl — @o), which is equivalent to 
ag = 1 — Gy_1,n-1(v). Hence, 1 — Gy,_1,,-1(v) is ‘the smallest ay such that we would 
reject Ho. rT] 


Performing an F Test. Suppose that six observations X,,..., X¢ are selected at ran- 


dom from a normal distribution for which both the mean jz; and the variance ae are 


unknown, and it is found that Se = 30. Suppose also that 21 observations, Yj, ..., Y24, 
are selected at random from another normal distribution for which both the mean 
jt and the variance a5 are unknown, and that it is found that Se = 40. We shall carry 


out an F test of the hypotheses (9.7.3). 


Example 
9.7.4 
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In this example, m = 6 and n = 21. Therefore, when Ap is true, the statistic V 
defined by Eq. (9.7.4) will have the F distribution with 5 and 20 degrees of freedom. 
It follows from Eq. (9.7.4) that the value of V for the given samples is 

Vea ee. 
40/20 
It is found from the tables given at the end of this book that the 0.95 quantile of the 
F distribution with 5 and 20 degrees of freedom is 2.71, and the 0.975 quantile of 
that distribution is 3.29. Hence, the tail area corresponding to the value V = 3 is less 
than 0.05 and greater than 0.025. The hypothesis Ho that ot < ie would therefore 
be rejected at the level of significance a) = 0.05, and Hg would not be rejected at 
the level of significance ag = 0.025. (Using a computer program to evaluate the c.d.f. 
of an F distribution provides the p-value equal to 0.035.) Finally, suppose that it is 
important to reject Hp if ot is three times as large as oe, We would then want the 
power function to be high when Ge = 305. We use a computer program to compute 


leas en (21 x :) = 0,498. 


Even if ot is three times as large as a, the level 0.05 test only has about a 50 percent 
chance of rejecting Ho. < 


Two-Sided Alternative 


Suppose that we wish to test the hypotheses 

Ho: or = oO}, 

A: ot x o5. 
It would make sense to reject Hp if either V <c, or V >c>, where V is defined in 
Eq. (9.7.4) and c; and c) are constants such that Pr(V <c,) + Pr(V = cy) = aq when 
ot = as. The most convenient choice of c; and c is the one that makes Pr(V < cy) = 
Pr(V > cy) = aq/2. That is, choose c; and c, to be the ag/2 and 1 — ap/2 quantiles of 
the appropriate F distribution. 


(9.7.7) 


Rain from Seeded Clouds. In Example 9.6.2, we compared the means of log-rainfalls 
from seeded and unseeded clouds under the assumption that the two variances were 
the same. We can now test the null hypothesis that the two variances are the same 
against the alternative hypothesis that the two variances are different at level of 
significance aj = 0.05. Using the statistics given in Example 9.6.2, the value of V 
is 63.96/67.39 = 0.9491, since m =n. We need to compare this to the 0.025 and 0.975 
quantiles of the F distribution with 25 and 25 degrees of freedom. Since our table of 
F distribution quantiles does not have rows or columns for 25 degrees of freedom, 
we can either interpolate between 20 and 30 degrees of freedom or use a computer 
program to compute these quantiles. The quantiles are 0.4484 and 2.2303. Since V is 
between these two numbers, we would not reject the null hypothesis at level ag = 0.05. 

< 


When m #n, the two-sided F test constructed above is not unbiased. (See 
Exercise 19.) Also, if m 4 n, it is not possible to write the two-sided F test described 
above in the form “reject the null hypothesis if T > c” using the same statistic T for 
each significance level ag. Nevertheless, we can still compute the smallest aj such 
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that the two-sided F test with level of significance ag would reject Hp. The proof of 
the following result is left to Exercise 15 in this section. 


P-Value of Equal-Tailed Two-Sided F Test. Let V be as defined in (9.7.4). Suppose that 
we wish to test the hypotheses (9.7.7). Let 6,,, be the equal-tailed two-sided F test 
that rejects Hy when V < c, or V > co, where c, and c) are, respectively, the aj/2 and 
1 —ap/2 quantiles of the appropriate F distribution. Then the smallest ag such that 
dq) Tejects Hy when V = v is observed is 


2 min{1 = Girt, n—1(0), Gin—1, n—1(V)}- (9.7.8) 


| The F Test as a Likelihood Ratio Test 


Next, we shall show that the F test for hypotheses (9.7.3) is a likelihood ratio test. 
After the values x1,..., x,, and yj,..., y, in the two samples have been observed, 
the likelihood function g(x, y|~1, 2, a 05) is 


8(X, Vii, Has 07> 05) = fin KM 07) fr Vilas 05): 


Here, both f,,(x| 41, a7) and f,,(y|/2, a5) have the general form given in Eq. (9.5.9). 
For the hypotheses in (9.7.3), Qg contains all parameters 6 = (14, Mo, a, 05) with 


a < os, and Q, contains all 0 with a: > a3. The likelihood ratio statistic is 


2 72 
SUP (141, 12,07,03):0?<03} g(x, YI, L2, O7> 05) 


A(x, y) = (9.7.9) 


2 2 
SUP (11,u,07,03) 8(X, IMI, Ha, O7> 05) 


The likelihood ratio test then specifies that Hp should be rejected if A(x, y) < k, where 
k is typically chosen to make the test have a desired level ap. 
To facilitate the maximizations in (9.7.9), let 


m n 
2 359. 2 —\2 
so= Ge —X,)°, and 5 = XC; — y,)°. 
i=1 i=1 
Then we can write 


2 22) 
g(x, YIH4, Lo, O71 ? 05) 


: : % 24 52 ! LG ee 
= (2 mtn) /2qitgh eo 20? [Pen Hy) +2 a 553" — [l9) +53| ; 


For both the numerator and denominator of (9.7.9), we need , =X, and 47 = y, in 
order to maximize the likelihood. If 2 /m~< /n, then the numerator is maximized 


at a = ee /m and a5 = te /n. These values also maximize the denominator. Hence, 
A(x, y) =1if s2/m < s,/n. For the other case (the numerator when s*/m > s,/n), it 


is straightforward to show that ot = a5 is required in order to achieve the maximum. 
In these cases, the maximum occurs when 


Substituting all of these values into (9.7.9) yields 


CF eee 2 
if s</m <5,/n, 


A = 
(x, y) dwm/2q — w)"/2 if s?/m > s?/m, 
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where 
2 (m+n) /2 
os 
w= x : and d= m+n’ 
a + se mn/2yn/2 


Note that s2/m < sy/n if and only if w<m/(m-+n). Next, use the fact that the 
function h(w) = w”/?(1 — w)"”? is decreasing for m/(m +n) < w <1. Finally, note 
that h(m/[m + n]) =1/d. For k <1, it follows that A(x, y) <k if and only if w > k’ 
for some other constant k’. This, in turn, is equivalent to ce /s? > k". Since s2 /s2 isa 
positive constant times the observed value of V, the likelihood ratio test rejects Ho 
when V is large. This is the same as the F test. 

One can easily adapt the above argument for the case in which the inequalities 
are reversed in the hypotheses. When the hypotheses are (9.7.7), that is, the alter- 
native is two-sided, one can show (see Exercise 16) that the size ap likelihood ratio 
test will reject Hp if either V <c, or V >c). Unfortunately, it is usually tedious to 
compute the necessary values c, and c. For this reason, people often abandon the 
strict likelihood ratio criterion in this case and simply let c, and cy be the ag/2 and 
1 — a@p/2 quantiles of the appropriate F distribution. 

>, 


“9 


Derivation of the p.d.f. of an F distribution 


Since the random variables Y and W in Definition 9.7.1 are independent, their joint 
p.d.f. g(y, w) is the product of their individual p.d.f.’s. Furthermore, since both Y and 
W have x? distributions, it follows from the p.d.f. of the x? distribution, as given in 
Eq. 8.2.1, that g(y, w) has the following form, for y > 0 and w > 0: 


g(y, w) = cy MIA yya/2)—1e—(ytw)/2. (9.7.10) 


where ; 
c= : (9.7.11) 


2(m+n)/2P (5) iy (5") 
2 2 


We shall now change variables from Y and W to X and W, where X is defined 
by Eq. (9.7.1). The joint p.d.f. h(x, w) of X and W is obtained by first replacing y in 
Eq. (9.7.10) with its expression in terms of x and w and then multiplying the result by 
|dy/dx|. It follows from Eq. (9.7.1) that y = (m/n)xw and dy/dx = (m/n)w. Hence, 
the joint p.d.f. h(x, w) has the following form, for x > 0 and w > 0: 


m/2 
h(x, w)=c (“) 4 /2)—1 yy Lontn)/2]-1 exp| 5 (2s +4 i) w (9.7.12) 
n n 


Here, the constant c is again given by Eq. (9.7.11). 
The marginal p.d.f f(x) of X can be obtained for each value of x > 0 from the 
relation 


[oe 
f(x) =i h(x, w) dw. (9.7.13) 
0 
It follows from Theorem 5.7.3 that 
1 
v0 i Pe r EG + | 
/ plntn)/2}-1 exp| 5 (2x + i) w| dw = (9.7.14) 
0 n 


4 in (m+n) /2° 
LE G+9)] 
2\n 
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From Egs. (9.7.11) to (9.7.14), we can conclude that the p.d.f f(x) has the form given 
in Eq. (9.7.2). 
2 


“ 


Summary 


If Y and W are independent with Y having the x? distribution with m degrees of free- 
dom and W having the x? distribution with n degrees of freedom, then (Y/m)/(W/n) 
has the F distribution with m and n degrees of freedom. Suppose that we observe two 
independent random samples from two normal distributions with possibly different 
variances. The ratio V of the usual unbiased estimators of the two variances will have 
an F distribution when the two variances are equal. Tests of hypotheses about the two 
variances can be constructed by comparing V to various quantiles of F distributions. 


Exercises 


1. Consider again the situation described in Exercise 11 of 
Sec. 9.6. Test the null hypothesis that the variance of the 
fusion time for subjects who saw a picture of the object 
is no smaller than the variance for subjects who did see a 
picture. The alternative hypothesis is that the variance for 
subjects who saw a picture is smaller than the variance 
for subjects who did not see a picture. Use a level of 
significance of 0.05. 


2. Suppose that a random variable X has the F distribu- 
tion with three and eight degrees of freedom. Determine 
the value of c such that Pr(X > c) = 0.975. 


3. Suppose that a random variable X has the F distribu- 
tion with one and eight degrees of freedom. Use the table 
of the ¢ distribution to determine the value of c such that 
Pr(X > c) =0.3. 


4. Suppose that a random variable X has the F distribu- 
tion with m and n degrees of freedom (n > 2). Show that 
E(X) =n/(n — 2). Hint: Find the value of E(1/Z), where 
Z has the x? distribution with n degrees of freedom. 


5. What is the value of the median of the F distribution 
with m and n degrees of freedom when m =n? 


6. Suppose that a random variable X has the F distri- 
bution with m and n degrees of freedom. Show that the 
random variable mX/(mX +n) has the beta distribution 
with parameters a = m/2 and B =n/2. 


7. Consider two different normal distributions for which 
both the means jz; and jz and the variances a and a5 
are unknown, and suppose that it is desired to test the 
following hypotheses: 


Suppose further that a random sample consisting of 16 ob- 
servations for the first normal distribution yields the val- 
ues YS X; = 84 and poor X? = 563, and an independent 
random sample consisting of 10 observations from the sec- 
ond normal distribution yields the values }> ea Y; = 18 and 
10 
a. What are the M.L.E.’s of oe and 03? 


b. If an F test is carried out at the level of significance 
0.05, is the hypothesis Hp rejected or not? 


8. Consider again the conditions of Exercise 7, but sup- 
pose now that it is desired to test the following hypotheses: 


Ho: ot < San. 
Ay: ot > 303. 
Describe how to carry out an F test of these hypotheses. 


9. Consider again the conditions of Exercise 7, but sup- 
pose now that it is desired to test the following hypotheses: 


Ho: oe = 05, 

Ay: ot #02. 

Suppose also that the statistic V is defined by Eq. 

(9.7.4), and it is desired to reject Ho if either V < c, or V => 

cy, where the constants c, and cz are chosen so that when 

Hp is true, Pr(V <c,) = Pr(V = cz) = 0.025. Determine 

the values of c; and cp when m = 16 and n = 10, as in 
Exercise 7. 


10. Suppose that a random sample consisting of 16 obser- 
vations is available from the normal distribution for which 
both the mean jp, and the variance oT are unknown, and 
an independent random sample consisting of 10 observa- 
tions is available from the normal distribution for which 
both the mean ;z2 and the variance ae are also unknown. 


For each constant r > 0, construct a test of the following 
hypotheses at the level of significance 0.05: 


a a 
Ho: 5 =", Ay: — #r. 
05 09 


11. Consider again the conditions of Exercise 10. Use the 
results of that exercise to construct a confidence interval 
for oT /o%, with confidence coefficient 0.95. 


12. Suppose that a random variable Y has the x? distribu- 
tion with mo degrees of freedom, and let c be a constant 
such that Pr(Y > c) = 0.05. Explain why, in the table of 
0.95 quantile of the F distribution, the entry for m = mo 
and n = oo will be equal to c/mpo. 


13. The final column in the table of the 0.95 quantile of the 
F distribution contains values for which m = oo. Explain 
how to derive the entries in this column from a table of 
the x? distribution. 


14. Consider again the conditions of Exercise 7. Find the 
power function of the F test when a = 2a2. 


15. Prove Theorem 9.7.5. Also, compute the p-value for 
Example 9.7.4 using the formula in Eq. (9.7.8). 


16. Let V be as defined in Eq. (9.7.4). We wish to deter- 
mine the size a likelihood ratio test of the hypotheses 
(9.7.7). Prove that the likelihood ratio test will reject Hp if 
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either V < cy or V > co, where Pr(V <cy) + Pr(V > c) = 
ao When ot = crt 

17. Prove that the test found in Exercise 9 is not a likeli- 
hood ratio test. 


18. Let 5 be the two-sided F test that rejects Hp in (9.7.3) 
when either V <c, or V > cy) with c, < cy. Prove that the 
power function of 6 is 


2 2 
W( fy, 2, OF, 05 |5) 


oy ox 
= Gin—1.n—1 7761 +1— Gin—1.n—1 —7 62 : 
val 7 


19. Suppose that X;,..., X,,; form arandom sample from 
the normal distribution with unknown mean ,1; and un- 
known variance op. Suppose also that Y;,..., Y>; form 
an independent random sample from the normal distribu- 
tion with unknown mean jy and unknown variance oF. 
Suppose that we wish to test the hypotheses in Eq. (9.7.7). 
Let 5 be the equal-tailed two-sided F test with level of 


significance ag = 0.5. 
a. Compute the power function of 5 when a; = 1.0103. 
b. Compute the power function of 5 when o; a oF /1.01. 


c. Show that 6 is not an unbiased test. (You will proba- 
bly need computer software that computes the func- 
tion G,,_;,,-1- And try to minimize the amount of 
rounding you do.) 


*9.8 Bayes Test Procedures 


Here we summarize how one tests hypotheses from the Bayesian perspective. The 
general idea is to choose the action (reject Hy or not) that leads to the smaller 
posterior expected loss. We assume that the loss of making an incorrect decision is 
larger than the loss of making a correct decision. Many of the Bayes test procedures 
have the same forms as the tests we have already seen, but their interpretations are 


different. 


Simple Null and Alternative Hypotheses 


Example 
9.8.1 


Service Times in a Queue. In Example 9.2.1, a manager was trying to decide which of 
two joint distributions better describes customer service times. She was comparing 


the two joint p.d.f’s f; and fo in Eqs. (9.2.1) and (9.2.2), respectively. Suppose that 
there are costs involved in making a bad choice. For example, if she chooses a joint 
distribution that models the service times as shorter than they really tend to be, 
there may be a cost due to customers becoming frustrated and taking their business 
elsewhere. On the other hand, if she chooses a joint distribution that models the 
service times as longer than they really tend to be, there may be a cost due to hiring 
additional unnecessary servers. How should the manager weigh these costs together 
with available evidence about how long she believes service times tend to be in order 
to choose between the two joint distributions? < 
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Consider a general problem in which the parameter space consists of two values 
Q = {69, 0,}. If 0 =6; (for i =0, 1), let Xj,..., X, form a random sample from a 
distribution for which the p.d.f. or the p.f. is f;(v). Suppose that it is desired to test 
the following simple hypotheses: 
Ho: 6= 4%, 
Ay: 6= 04. 
We shall let dy) denote the decision not to reject the hypothesis Hy and let d, 
denote the decision to reject Hy. Also, we shall assume that the losses resulting from 
choosing an incorrect decision are as follows: If decision d, is chosen when Hg is 
actually the true hypothesis (type I error), then the loss is wo units; if decision dp is 
chosen when Hy is actually the true hypothesis (type II error), then the loss is w; 
units. If the decision dp is chosen when Hp is the true hypothesis or if the decision d; 
is chosen when Hy is the true hypothesis, then the correct decision has been made 
and the loss is 0. Thus, for i = 0, 1 and j = 0, 1, the loss L(6;, d;) that occurs when 6; 
is the true value of 6 and the decision d; is chosen is given by the following table: 


(9.8.1) 


do dy 
ci) 0 wo (9.8.2) 
CT W1 0 


Next, suppose that the prior probability that Hp is true is &, and the prior 
probability that H; is true is €; = 1-— &. Then the expected loss r(6) of each test 
procedure 6 will be 


r(d) = &)E (Loss |@ = 69) + €E(Loss|0 = 6)). (9.8.3) 


If a(S) and 6(6) again denote the probabilities of the two types of errors for the 
procedure 6, and if the table of losses just given is used, it follows that 


E(Loss|@ = 69) = wo Pr(Choosing d;|9 = 69) = wow (5), 


: (9.8.4) 
E(Loss|@ = 6,) = w, Pr(Choosing do|6 = 6;) = w1B(6). 


Hence, 
r(5) = Cowoa (5) + ¢,w1B(S). (9.8.5) 


A procedure 6 for which this expected loss r(5) is minimized is called a Bayes test 
procedure. 

Since r(6) is simply a linear combination of the form aa(é) + bB(6) with a = 
Eywo and b = €,;w,, a Bayes test procedure can immediately be determined from 
Theorem 9.2.1. Thus, a Bayes procedure will not reject Hy whenever €ywo fo(x) > 
&w1f\(x) and will reject Hy whenever ywo fo(x) < €w1 f(x). We can either reject 
Ap or not if Ggwo fo) = &w1 f(x). For simplicity, in the remainder of this section, 
we shall assume that Hp is rejected whenever yw fo(x) = €;w1 f(x). 


Note: Bayes Test Depends Only on the Ratio of Costs. Notice that choosing 5 to 
minimize r(6) in Eq. (9.8.5) is not affected if we multiply wo and w, by the same 
positive constant, such as 1/wp. That is, the Bayes test 6 is also the test that minimizes 


r*(8) = ga(6) + f1—* BO). 
0 


So, a decision maker does not need to choose both of the two costs of error, but 
rather just the ratio of the two costs. One can think of choosing the ratio of costs as 
a replacement for specifying a level of significance when selecting a test procedure. 


Example 
9.8.2 
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Service Times in a Queue. Suppose that the manager believes that each of the two 
models for service times is equally likely before observing any data so that = € = 
1/2. The model with joint p.d-f. f; predicts both extremely large service times and 
extremely small service times to be more likely than does the model with joint p.d.f. 
Jo. Suppose that the cost of modeling extremely large service times as being less likely 
than they really are is the same as the cost of modeling extremely large service times 
to be more likely than they really are. The ratio of the cost of type H error w, to the 
cost of type I error wo is then w;/wp = 1. The Bayes test is then to choose d, (reject 
Ho) if fo(x) < fi). This is equivalent to f,(x)/fo(x) > 1. <1 


Tests Based on the Posterior Distribution 


From the Bayesian viewpoint, it is more natural to base a test on the posterior 
distribution of 6 rather than on the prior distribution and the probabilities of error 
as we did in the preceding discussion. Fortunately, the same test procedure arises 
regardless of how one derives it. For example, Exercise 5 in this section asks you to 
prove that the test derived by minimizing a linear combination of error probabilities 
is the same as what one would obtain by minimizing the posterior expected value 
of the loss. The same is true in general when the losses are bounded, but the proof 
is more difficult. For the remainder of this section, we shall take the more natural 
approach of trying to minimize the posterior expected value of the loss directly. 

Return again to the general situation in which the null hypothesis is Hp : 8 € Qo 
and the alternative hypothesis is H;:@ € Qy, where Qo U Q, is the entire parameter 
space. As we did above, we shall let dy) denote the decision not to reject the null 
hypothesis Hp and let d, denote the decision to reject Hp. As before, we shall assume 
that we incur a loss of wy by making decision d, when Hp is actually true, and a loss of 
w is incurred if we make decision dy) when Hy is true. (More realistic loss functions 
are available, but this simple type of loss will suffice for an introduction.) The loss 
function L(6, d;) can be summarized in the following table: 


do dy 
If Hp is true 0 wo (9.8.6) 
If Hy, is true wy 0 


We shall now take the approach outlined in Exercise 5. Suppose that €(@|x) is the 
posterior p.d.f. for 0. Then the posterior expected loss r(d;|x) for choosing decision 
d,; (i =0, 1) is 

r(d;|x) -|/ L(O@, d;)E(O|x) dd. 


We can write a simpler formula for this posterior expected loss for each of i = 0, 1: 


r(do|x) = / w 1é(6|x) dO = w,[1 — Pr(Ap true|x)], 
2 

r(d\|x) = i. wot (6|x) dO = wo Pr(Ap true|x). 
2 


The Bayes test procedure is to choose the decision that has the smaller posterior 
expected loss, that is, choose dg if r(do|x) < r(d\|x), choose d, if r(do|x) > r(dj|x). 
Using the expressions above, it is easy to see that the inequality r(do|x) > r(d\|x) 
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(when to reject Hp) can be rewritten as 


Pr(Ap true|x) < v1 


cia (9.8.7) 
just as in part (c) of Exercise 5. 

The test procedure that rejects Hj when (9.8.7) holds is the Bayes test in all 
situations in which the loss function is given by the table in (9.8.6). This result holds 
whether or not the distributions have monotone likelihood ratio, and it even applies 
when the alternative is two-sided or when the parameter is discrete rather than 
continuous. Furthermore, the Bayes test produces the same result if one were to 
switch the names of Hp and Hj, as well as the losses wp and w, and the names of the 
decisions dp and d,. (See Exercise 11 in this section.) 

Despite the generality of (9.8.7), it is instructive to examine what the procedure 
looks like in special cases that we have already encountered. 


One-Sided Hypotheses 


Suppose that the family of distributions has a monotone likelihood ratio and that the 
hypotheses are 
Hy: @ <4, 
oe (9.8.8) 
Ay: 06> Oo. 


We shall prove next that the Bayes procedure that rejects Hy when (9.8.7) holds is a 
one-sided test as in Theorem 9.3.1. 


Suppose that f,(x|@) has a monotone likelihood ratio in the statistic T = r(X). Let 
the hypotheses be as in Eq. (9.8.8), and assume that the loss function is of the form 


do dy 
O< ay 0 wo 
0> ay W4 0 


where wo, w, > 0 are constants. Then a test procedure that minimizes the posterior 
expected loss is to reject Hy when T > c for some constant c (possibly infinite). 


Proof According to Bayes’ theorem for parameters and samples, (7.2.7), the poste- 
rior p.d.f. €(@|x) can be expressed as 


Fl) 
Jo In@lWE(H) dy 


The ratio of the posterior expected loss from making decision dy to the posterior 
expected loss from making decision d after observing X = x is 


Joy WE Olx) dO wi Igy In(¥18)E(8) dO 
[2 woeWix) db — wo f. falWEd dv 


What we need to prove is that £(x) > 1 is equivalent to T > c. It suffices to show 
that £(v) is a nondecreasing function in T = r(x). Let x, and x, be two possible 
observations with the property that r (x1) < r(x). We want to prove that €(v1) < €(%>). 


C(@|x) = 


(x) = (9.8.9) 


Example 
9.8.3 
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We can write 
wi Iq fn(e1l0)E(0) do wi fo, In¥21AEO) dO 
wo fe faeWE) db — wo [fra WE dv 


We can put the two fractions on the right side of Eq. (9.8.10) over the common 


denominator w? f°. fr@ralWEC) dv fe, frei WEY) dy. The numerator of the 
resulting fraction is wyw, times 


(x) — £(xy) = 9.8.10) 


oe) A 
/ e'one(@) a | Fr¥2lWe(w) dw 


[oe] 89 
= i Sn (%2|@)E (8) de / fr@ilwe(y) dw. (9.8.11) 
0 —Co 
We only need to show that (9.8.11) is at most 0. The difference in (9.8.11) can be 
written as the double integral 


co rh 
i / CMOS W tn) fralv) — fr®2l@) frail] dy do. (9.8.12) 


Notice that for all 6 and w in this double integral, 6 > 0) > yw. Since r(x) < r(x), 
monotone likelihood ratio implies that 


Fr(¥1l0) _ fr¥2I®) — 9 
Fn@ilv) — frealy) 


If one multiplies both sides of this last expression by the product of the two denom- 
inators, the result is 


Fr) fr@alW) — fr 210) tril) < 0. (9.8.13) 


Notice that the left side of Eq. (9.8.13) appears inside the square brackets in the 
integrand of (9.8.12). Since this is nonpositive, it implies that (9.8.12) is at most 0, 
and so (9.8.11) is at most 0. rT 


Calorie Counts on Food Labels. In Example 7.3.10 on page 400, we were interested in 
the percentage differences between the observed and advertised calorie counts for 
nationally prepared foods. We modeled the differences Xj, ..., X99 as normal ran- 
dom variables with mean @ and variance 100. The prior for 6 was a normal distribution 
with mean 0 and variance 60. The family of normal distributions has a monotone like- 
lihood ratio in the statistic X59 = 5 a X;. The posterior distribution of @ is the 
normal distribution with mean 


— 100 x 0 +20 x 60 x Xo 


_ = 0.923¥ 
va 100 + 20 x 60 20 


and variance vy = 4.62. Suppose that we wish to test the null hypothesis Hy :6 <0 
versus the alternative H, :6 > 0. The posterior probability that Hp is true is 


Pr(O <0|X) = ® (<4) = ©(—0,.429X 59). 
vy 
The Bayes test will reject Ho if this probability is at most w,/(wg + wy). Since ® 
is a strictly increasing function, ®(—0.429X 49) < w;/(wo + w,) if and only if X49 > 
—O-"(w,/(wo + w,))/0.492. This is in the form of a one-sided test. <J 
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Figure 9.16 Plot of Pr(|@| < 
d\x) against d for Exam- 
ple 9.8.4. The dotted lines 
indicate that the median of 
the posterior distribution of 
|O| is 1.455. 


Two-Sided Alternatives 


On page 571, we argued that the hypotheses 

Ho: 6= 4, 

Ay: 0 # A 
might be a useful surrogate for the null hypothesis that @ is close to 6) against the 
alternative that it is not close. If the prior distribution of 6 is continuous, then the 
posterior distribution will usually be continuous as well. In such cases, the posterior 
probability that Hp is true will be 0, and Hp would be rejected without having to refer 
to the data. If one believed that 6 = 0) with positive probability, one should use a 
prior distribution that is not continuous, but we shall not take that approach here. 
(See a more advanced text, such as Schervish, 1995, section 4.2, for treatment of that 
approach.) Instead, we can calculate the posterior probability that 6 is close to 6. If 
this probability is too small, we can reject the null hypothesis that 0 is close to 6). To 
be specific, let d > 0, and consider the hypotheses 


Ay: |6—6| <4, 
Ay: |0 = A| >d. 
Many experimenters might choose to test the hypotheses in (9.8.14) rather than those 
in (9.8.15) because they are not ready to specify a particular value of d. In such cases, 


one could calculate the posterior probability of |9 — 69| < d for all d and draw a little 
plot. 


(9.8.14) 


(9.8.15) 


Calorie Counts on Food Labels. Suppose that we wish to test the hypotheses (9.8.15) 
with 6) = 0 in the situation described in Example 9.8.3. In Example 7.3.10, we found 
that the posterior distribution of 6 was the normal distribution with mean 0.1154 and 
variance 4.62. We can easily calculate 


Pr(|@ — 0| <dlx) = Pr(—d <0 <dlx) = o(2 = aa) (= = sa) . 


4.621/2 4621/2 


for every value of d that we want. Figure 9.16 shows a plot of the posterior probability 
that |6| is at most d for all values of d between 0 and 5. In particular, we see that 
Pr(|O| <5|x) is very close to 1. If 5 percent is considered a small discrepancy, then we 
can be pretty sure that |6| is small. On the other hand, Pr(|6| > 1|x) is greater than 
0.6. If 1 percent is considered large, then there is a substantial chance that |0| is large. 

< 


Example 
9.8.5 


Figure 9.17 Histogram of 
parathion measurements on 
77 celery samples. 
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Note: What Counts as a Meaningful Difference?. The method illustrated in Ex- 
ample 9.8.4 raises a useful point. In order to complete the test procedure, we need 
to decide what counts as a meaningful difference between 6 and 6). Otherwise, we 
cannot say whether or not the probability is large that a meaningful difference ex- 
ists. Forcing experimenters to think about what counts as a meaningful difference is 
a good idea. Testing the hypotheses (9.8.14) at a fixed level, such as 0.05, does not 
require anyone to think about what counts as a meaningful difference. Indeed, if an 
experimenter did bother to decide what counted as a meaningful difference, it is not 
clear how to make use of that information in choosing a significance level at which 
to test the hypotheses in (9.8.14). 


Testing the Mean of a Normal Distribution with Unknown Variance 


In Sec. 8.6, we considered the case in which a random sample is drawn from a normal 
distribution with unknown mean and variance. We introduced a family of conjugate 
prior distributions and found that the posterior distribution of a linear function of the 
mean jis at distribution. If we wish to test the null hypothesis that j lies in an interval 
using (9.8.7) as the condition for rejecting the null hypothesis, then we only need a 
table or computer program to calculate the c.d.f. of an arbitrary ¢ distribution. Most 
statistical software packages allow calculation of the c.d.f. and the quantile function of 
an arbitrary r distribution, and hence we can perform Bayes tests of null hypotheses 
of the form pw < Mo, “> Mo, OF dy < WU < do. 


Pesticide Residue on Celery. Sharpe and Van Middelem (1955) describe an experiment 
in which n = 77 samples of parathion residue were measured on celery after the 
vegetable had been taken from fields sprayed with parathion. Figure 9.17 shows 
a histogram of the observations. (Each concentration Z in parts per million was 
transformed to X = 100(Z — 0.7) for ease of recording.) Suppose that we model the 
X values as normal with mean yw and variance o*. We will use an improper prior for 
and o”. The sample average is X,, = 50.23, and 


77 
5, =) (xj — ¥77)” = 34106. 
i=1 
As we saw in Eq. (8.6.21), this means that the posterior distribution of 
n?(4 —x,) —_ 77"?(u — 50.23) 


= = 0.4142 — 20.81 
(s2/(n— 1/2 (34106/76)"/2 i 


0 20 40 60 80 100 
Parathion measurement 
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is the f distribution with 76 degrees of freedom. Suppose that we are interested in 
testing the null hypothesis Hp : ~ > 55 against the alternative H, : 4 < 55. Suppose 
that our losses are described by (9.8.6). Then we should reject Ho if its posterior 
probability is at most ag = w)/(wo + w)). If we let 7,_, stand for the c.d-f. of the r 
distribution with n — 1 degrees of freedom, we can write this probability as 


J 


_ n?(5=%,) 
=1-T,_; (a — =e), (9.8.16) 


Simple manipulation shows that this last probability is at most ap if and only if 
U 27.0 — a), where U is the random variable in Eq. (9.5.2) that was used to 
define the t test. Indeed, the level ag ¢ test of Hy versus Hj is precisely to reject Ho 
if U <-T).(1 — ao). For the data in this example, the probability in Eq. (9.8.16) is 
1 — T46(1.974) = 0.026. < 


n= Fy). WSS = ay) 
(s2/(n — 1)? ~ (s2/(n — 1)? 


Pr(ye > 55|x) = P( 


Note: Look at Your Data. The histogram in Fig. 9.17 has a strange feature. Can you 
specify what it is? If you take a course in data analysis, you will probably learn some 
methods for dealing with data having features like this. 


Note: Bayes Tests for One-Sided Nulls with Improper Priors Are ¢ Tests. In Exam- 
ple 9.8.5, we saw that the Bayes test for one-sided hypotheses was the level ag t test 
for the same hypotheses where ag = w1/(wg + w1). This holds in general for normal 
data with improper priors. It also follows that the p-values in these cases must be the 
same as the posterior probabilities that the null hypotheses are true. (See Exercise 7 
in this section.) 


Comparing the Means of Two Normal Distributions 


Next, consider the case in which we shall observe two independent normal random 
samples with common variance o*: X,,..., X,, with mean jz, and Y;,..., Y,, with 
mean />. In order to use the Bayesian approach, we need the posterior distribution 
of 44 — 2. We could introduce a family of conjugate prior distributions for the three 
parameters jv, /47, and t = 1/0, and then proceed as we did in Sec. 8.6. For simplicity, 
we shall only handle the case of improper priors in this section, although there are 
proper conjugate priors that will lead to more general results. The usual improper 
prior for each parameter jz; and /22 is the constant function 1, and the usual improper 
prior for t is 1/t for t > 0. If we combine these as if the parameters were independent, 
the improper prior p.d.f. would be €(441, “2, tT) = 1/t for t > 0. We can now find the 
posterior joint distribution of the parameters. 


Suppose that X,,..., X,, form a random sample from a normal distribution with 
mean ;4, and precision t while Y,,..., Y, form a random sample from a normal 
distribution with mean j, and precision t. Suppose that the parameters have the 
improper prior with “p.d-f.” €(41, “2, T) =1/t for t > 0. The posterior distribution 
of 


22 Hs = - en —Yn) (9.8.17) 
(|, is ) (52 + 92)" 
n : ‘ 


m 


(m+n 
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is the ¢ distribution with m +n —2 degrees of freedom, where s2 and Om are the 
observed values of S. and Se, respectively. a 


The proof of Theorem 9.8.2 is left as Exercise 8 because it is very similar to results 
proven in Sec. 8.6. 
For testing the hypotheses 


Ho: 44 — M2 <9, 

Ay: Ly — boa > 0, 
we need the posterior probability that 4; — 2 < 0, which is easily obtained from 
the posterior distribution. Using the same idea as in Eq. (9.8.16), we can write 
Pr(jz1 — (2 < O|x, y) as the probability that the random variable in (9.8.17) is at most 
—u, where u is the observed value of the random variable U in Eq. (9.6.3). It follows 
that 


Pr(py —pos Olx, y) = Tin+n—2(-4), 


where T,,4n—2 is the c.d.f. of the ¢ distribution with m +n — 2 degrees of freedom. 
Hence, the posterior probability that Ho is true is less than w1/(wo + w)) if and only 
if 

Wi 
Wo + W1 , 


Tin4n-2(-U) < 


This, in turn is true if and only if 


= W 
Hes isa ( : ), 
Wo + Wy 


This is true if and only if 


= W 
WS Tod (1 — a). (9.8.18) 
If ap = w1/(wo + w}), then the Bayes test procedure that rejects Hy when Eq. (9.8.18) 
occurs is the same as the level ag two-sample ¢ test derived in Sec. 9.6. Put another 
way, the one-sided level a) two-sample t test rejects the null hypothesis Ho if and only 
if the posterior probability that Ho is true (based on the improper prior) is at most ag. 
It follows from Exercise 7 that the posterior probability of the null hypothesis being 
true must equal the p-value in this case. 


Roman Pottery in Britain. In Example 9.6.3, we observed 14 samples of Roman pottery 
from Llanederyn in Great Britain and another five samples from Ashley Rails, and 
we were interested in whether the mean aluminum oxide percentage in Llanederyn 
/41 was larger than that in Ashley Rails jz». We tested Ho : 41 > 2 against Hy: 44 < 2 
and found that the p-value was 4 x 10~°. If we had used an improper prior for the 
parameters, then Pr(jz, > j9|x) = 4 x 107°. J 


Two-Sided Alternatives with Unknown Variance To test the hypothesis that the 
mean y of a normal distribution is close to 49, we could specify a specific value d and 
test 

Ay: |u — Mol <4, 

Hy: | — mol >. 
If we do not feel comfortable selecting a single value of d to represent “close,” we 
could compute Pr(|/z — 4o| < d|x) for alld and draw a plot as we did in Example 9.8.4. 
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Figure 9.18 Plot of Pr(|jz; — 
[L7| < d|x) against d. The 
dotted lines indicate that 
the median of the posterior 
distribution of |, — 49| is 
4.76. 


Example 
9.8.7 


The case of testing that two means are close together can be dealt with in the same 
way. 


Roman Pottery in Britain. In Example 9.8.6, we tested one-sided hypotheses about the 
difference in aluminum oxide contents in samples of pottery from two sites in Great 
Britain. Unless we are specifically looking for a difference in a particular direction, 
it might make more sense to test hypotheses of the form 


Ho: |w1 — Mal <4, 

Ay: |My — M2| > 4, 
where d is some critical difference that is worth detecting. As we did in Example 9.8.4, 
we can draw a plot that allows us to test all hypotheses of the form (9.8.19) simultane- 
ously. We just plot Pr(|uw1 — 2| < d|x) against d. The posterior distribution of 4 — 2 
was found in Eq. (9.8.17), using the improper prior. In this case, the following random 
variable has the ¢ distribution with 17 degrees of freedom: 


(9.8.19) 


a? M17 H2 7 Gm — Yn) 
1/2 
1 1 
(3+ 1) ea? 


_ 17/2 My b2—- (12.56 = 17.32) 
~ 1/2 
(h+4) > 24654110017 


(m+n 


= 1.33(j41 — fo + 4.76), 


where the data summaries come from Example 9.6.3. It follows that 
Pr(|“1 — M2| < |x) 
= Pr(1.33(—d + 4.76) < 1.33(y1 — Wz + 4.76) < 1.33(d + 4.76)|x) 
= T,7(1.33(d + 4.76)) — T,7(1.33(—d + 4.76)), 


where 77 is the c.d.f. of the ¢ distribution with 17 degrees of freedom. Figure 9.18 is 
the plot of this posterior probability against d. < 


Comparing the Variances of Two Normal Distributions 


In order to test hypotheses concerning the variances of two normal distributions, we 
can make use of the posterior distribution of the ratio of the two variances. Suppose 
that X;,..., X,, is a random sample from the normal distribution with mean 1; 
and variance a and Y;,..., ¥, 1s arandom sample from the normal distribution 
with mean 2 and variance oe If we model the X data and associated parameters 
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as independent of the Y data and associated parameters, then we can perform two 
separate analyses just like the one in Sec. 8.6. In particular, we let t; = 1/ oa? fori =1, 2, 
and the joint posterior distribution will have (j11, t;) independent of (j12, t) and 
each pair will have a normal-gamma distribution just as in Sec. 8.6. For convenience, 
we shall only do the remaining calculations using improper priors. With improper 
priors, the posterior distribution of t, is the gamma distribution with parameters 
(m — 1)/2 and a2, where se is defined in Theorem 9.8.2. We also showed in Sec. 8.6 
(using Exercise 1 in Sec. 5.7) that 152 has the x? distribution with m — 1 degrees of 
freedom. Similarly, TS, has the x? distribution with n — 1 degrees of freedom. Since 
152 /(m — 1) and Ts /(n — 1) are independent, their ratio has the F distribution with 
m — 1andn — 1 degrees of freedom. That is, the posterior distribution of 


ts¢/(m—1) _ s2/[(m— Ho?) 
T82/(n =i ~. s/[(n — 1)o4] 


(9.8.20) 


is the F distribution with m —1 and n —1 degrees of freedom. Notice that the 
expression on the right side of Eq. (9.8.20) is the same as the random variable V* 
in Eq. (9.7.5). This is another case in which the sampling distribution of a random 
variable is the same as its posterior distribution. It will then follow that level ap tests 
of one-sided hypotheses about ot / a5 based on the sampling distribution of V* will be 
the same as Bayes tests of the form (9.8.7) so long as ap = w1/(wop + w1). The reader 
can prove this in Exercise 9. 


Summary 


From a Bayesian perspective, one chooses a test procedure by minimizing the pos- 
terior expected loss. When the loss has the simple form of (9.8.6), then the Bayes 
test procedure is to reject Hy when its posterior probability is at most w,/(wp + w). 
In many one-sided cases, with improper priors, this procedure turns out to be the 
same as the most commonly used level aj = w1/(wp + wy) test. In two-sided cases, 
as an alternative to testing Hj: 6 = 6) against H;:6 #6, one can draw a plot of 
Pr(|@ — | <d|x) against d. One then needs to decide which values of d count as 
meaningful differences. 


Exercises 


1. Suppose that a certain industrial process can be either 
in control or out of control, and that at any specified time 
the prior probability that it will be in control is 0.9, and 
the prior probability that it will be out of control is 0.1. A 
single observation X of the output of the process is to be 
taken, and it must be decided immediately whether the 
process is in control or out of control. If the process is 
in control, then X will have the normal distribution with 
mean 50 and variance 1.If the process is out of control, 
then X will have the normal distribution with mean 52 and 
variance 1. 

If it is decided that the process is out of control when 
in fact it is in control, then the loss from unnecessarily 
stopping the process will be $1000. If it is decided that the 
process is in control when in fact it is out of control, then 


the loss from continuing the process will be $18,000. If a 
correct decision is made, then the loss will be 0. It is desired 
to find a test procedure for which the expected loss will be 
a minimum. For what values of X should it be decided that 
the process is out of control? 


2. A single observation X is to be taken from a continuous 
distribution for which the p.d.f. is either fp or f,;, where 


1 forO<x <1, 
0 otherwise, 


fale) =| 


and 


3 
eee 4x° for0<x <1, 
0 otherwise. 
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On the basis of the observation X, it must be decided 
whether fo or f; is the correct p.d.f. Suppose that the prior 
probability that fp is correct is 2/3 and the prior probability 
that f; is correct is 1/3. Suppose also that the loss from 
choosing the correct decision is 0, the loss from deciding 
that f, is correct when in fact fp is correct is 1 unit, and 
the loss from deciding that fp is correct when in fact f; is 
correct is 4 units. If the expected loss is to be minimized, 
for what values of X should it be decided that fo is correct? 


3. Suppose that a failure in a certain electronic system can 
occur because of either a minor or a major defect. Suppose 
also that 80 percent of the failures are caused by minor 
defects, and 20 percent of the failures are caused by major 
defects. When a failure occurs, n independent soundings 
X\,..., X, are made on the system. If the failure was 
caused by a minor defect, these soundings form a random 
sample from the Poisson distribution with mean 3. If the 
failure was caused by a major defect, these soundings form 
a random sample from a Poisson distribution for which 
the mean is 7. The cost of deciding that the failure was 
caused by a major defect when it was actually caused 
by a minor defect is $400. The cost of deciding that the 
failure was caused by a minor defect when it was actually 
caused by a major defect is $2500. The cost of choosing a 
correct decision is 0. For a given set of observed values of 
X4,..., X,, which decision minimizes the expected cost? 


4. Suppose that the proportion p of defective items in a 
large manufactured lot is unknown, and it is desired to test 
the following simple hypotheses: 


A: p= 0.3, 

Ay: p= 0.4. 
Suppose that the prior probability that p = 0.3 is 1/4, and 
the prior probability that p = 0.4 is 3/4; also suppose that 
the loss from choosing an incorrect decision is 1 unit, and 
the loss from choosing a correct decision is 0. Suppose that 
arandom sample of n items is selected from the lot. Show 


that the Bayes test procedure is to reject Ho if and only if 
the proportion of defective items in the sample is greater 


than 
log(Z) + tlog($) 


log( 4) 
5. Suppose that we wish to test the hypotheses (9.8.1). Let 
the loss function have the form of (9.8.2). 


a. Prove that the posterior probability of 6 = 6 is 
Eo fo(x)/[E0 fo) + &fi)]- 

b. Prove that a test that minimizes r(6) also minimizes 
the posterior expected value of the loss given X =x 
for all x. 

c. Prove that the following test is one of the tests de- 
scribed in part (b): “reject Ho if Pr(Hp true|x) < 
w/(wo + wy).” 


6. Prove that the conclusion of Theorem 9.8.1 still holds 
when the loss function is given by 


do dy 
0 s iy 0 wo(@) 
0> ay w,(0) 0 


for arbitrary positive functions wo(9) and w (0). Hint: 
Replicate the proof of Theorem 9.8.1, but replace the con- 
stants wo and wy, by the functions above and keep them 
inside of the integrals instead of factoring them out. 


7. Suppose that we have a situation in which the Bayes 
test that rejects Hy) when Pr(Hp true |x) < a is the same 
as the level ag test of Ho for all wp. (Example 9.8.5 has this 
property, but so do many other situations.) Prove that the 
p-value equals the posterior probability that Ho is true. 


8. In this exercise you will prove Theorem 9.8.2. 


a. Prove that the joint p.d.f. of the data given the pa- 
rameters j41, 42, and t can be written as a constant 
times 


pintn)/2 exp(—0.5mr (uy ~@.)" 


—0.5nt (fey — i = 0.5(s2 Te s3)t). 


b. Multiply the prior p.d.f. times the p.d.f. in part (a). 
Bayes’ theorem for random variables says that the 
result is proportional (as a function of the parame- 
ters) to the posterior p.d_f. 


i. Show that the posterior p.d.f., as a function of 
[1 for fixed jz and 7, is the p.d-f. of the normal 
distribution with mean X,, and variance (mt)7!. 

ii. Show that the posterior p.d.f, as a function of 
[42 for fixed jz, and 7, is the p.d-f. of the normal 
distribution with mean y, and variance 
(ar)-h 

iii. Show that, conditional on rt, jz; and jz are inde- 
pendent with the two normal distributions found 
above. 

iv. Show that the marginal posterior distribution 
of t is the gamma distribution with parameters 
(m+n —2)/2 and (s2 + s2)/2. 


c. Show that the conditional distribution of 


= 2 M1 — h2— Xm 7 Yn) 
1 1 1/2 
m n 


given t is a standard normal distribution and hence 
Z is independent of t. 


Z 


d. Show that the distribution of W = (s2 + s?)t is the 
gamma distribution with parameters (m +n — 2)/2 
and 1/2, which is the same as the x? distribution with 
m+n —2 degrees of freedom. 


e. Prove that Z/(W/(m +n — 2))!/ has the t distribu- 
tion with m +n — 2 degrees of freedom and that it 
equals the expression in Eq. (9.8.17). 

9. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with mean sj; and variance a, 
and Y;,..., Y,, form a random sample from the normal 
distribution with mean jy and variance oF. Suppose that 
we use the usual improper prior and that we wish to test 
the hypotheses 


Ho: 


of $05, 
A: ot a5. 
a. Prove that the level ag F test is the same as the test 
in (9.8.7) when aj = w1/(wo + w)4). 
b. Prove that the p-value for the F test is the posterior 
probability that Hp is true. 


10. Consider again the situation in Example 9.6.2. Let 4 
be the mean of log-rainfall from seeded clouds, and let x4 
be the mean of log-rainfall from unseeded clouds. Use the 
improper prior for the parameters. 


a. Find the posterior distribution of 4 — p12. 


b. Draw a graph of the posterior probability that 
| 44 — 2| < d as a function of d. 
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11. Let 6 bea general parameter taking values in a param- 
eter space Q. Let Q' |) Q” = Q bea partition of Q into two 
disjoint sets Q’ and Q”. We want to choose between two 
decisions: d’ says that 6 € Q’, and d” says that 6 € Q”. We 
have the following loss function: 


d' d” 
faeQ’ 0 w’ 
If@ €Q” w" 0 


We have two choices for expressing this decision problem 
as a hypothesis-testing problem. One choice would be to 
define Hp :@ € Q’ and H, :@ € 2”. The other choice would 
be to define Hy :@ € Q” and H,:@ € Q’. In this problem, 
we show that the Bayes test makes the same decision 
regardless of which hypothesis we call the null and which 
we call the alternative. 


a. For each choice, say how we would define each of 
the following in order to make this problem fit the 
hypothesis-testing framework described in this sec- 
tion: Wo, U4, do, di, Qo, and Qy. 

b. Now suppose that we can observe data X =x and 
compute the posterior distribution of 6, €(6|x). Show 
that, for each of the two setups constructed in the 
previous part, the Bayes test chooses the same deci- 
sion d’ or d”. That is, observing x leads to choosing 
d’ in the first setup if and only if observing x leads to 
choosing d’ in the second setup. Similarly, observing 
x leads to choosing d” in the first setup if and only if 
observing x leads to choosing d” in the second setup. 


*9.9 Foundational Issues 


We discuss the relationship between significance level and sample size. We also 
distinguish between results that are significant in the statistical sense and those that 
are significant in a practical sense. 


The Relationship between Level of Significance and Sample Size 


In many statistical applications, it has become standard practice for an experimenter 
to specify a level of significance ag, and then to find a test procedure with a large 
power function on the alternative hypothesis among all procedures whose size a(5) < 
ay. Alternatively, the experimenter will compute a p-value and report whether or 
not it was less than ap. For the case of testing simple null and alternative hypotheses, 
the Nayman-Pearson lemma explicitly describes how to construct such a procedure. 
Furthermore, it has become traditional in many applications to choose the level of 
significance ag to be 0.10, 0.05, or 0.01. The selected level depends on how serious the 
consequences of an error of type I are judged to be. The value of ag most commonly 
used is 0.05. If the consequences of an error of type I are judged to be relatively mild 
in a particular problem, the experimenter may choose a to be 0.10. On the other 
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hand, if these consequences are judged to be especially serious, the experimenter 
may choose a to be 0.01. 

Because these values of ag have become established in statistical practice, the 
choice of ag = 0.01 is sometimes made by an experimenter who wishes to use a 
cautious test procedure, or one that will not reject Hp unless the sample data provide 
strong evidence that Hp is not true. We shall now show, however, that when the sample 
size n is large, the choice of ag = 0.01 can actually lead to a test procedure that will 
reject Hp for certain samples that, in fact, provide stronger evidence for Hp than they 
do for Hj. 

To illustrate this property, suppose, as in Example 9.2.5, that a random sample 
is taken from the normal distribution with unknown mean @ and known variance 1, 
and that the hypotheses to be tested are 

Ho: 6= 0, 

Ay: 6=1. 
It follows from the discussion in Example 9.2.5 that, among all test procedures for 
which @(5) < 0.01, the probability of type II error 6(5) will be a minimum for the 
procedure 6* that rejects Hy when X,, > k’, where k’ is chosen so that Pr(X,, > k’|6 = 
0) = 0.01. When 6 = 0, the random variable X,, has the normal distribution with mean 
0 and variance 1/n. Therefore, it can be found from a table of the standard normal 
distribution that k’ = 2.326n-1/?, 

Furthermore, it follows from Eq. (9.2.12) that this test procedure 5* is equiv- 
alent to rejecting Hy when f,(x)/fo(v) >k, where k = exp(2.326n/? — 0.5n). The 
probability of an error of type I will be a(6*) = 0.01. Also, by an argument simi- 
lar to the one leading to Eq. (9.2.15), the probability of an error of type II will be 
B(5*) = ©(2.326 — n'/?), where ® denotes the c.d.f. of the standard normal distribu- 
tion. For n = 1, 25, and 100, the values of 6(6*) and k are as follows: 


n a(d*) B(6*) k 
1 0.01 0.91 6.21 
25 0.01 0.0038 0.42 
100 0.01 8 x 10745 2.5 x 10712 


It can be seen from this tabulation that when n = 1, the null hypothesis Ho will be 
rejected only if the likelihood ratio f;(x)/fp(x) exceeds the value k = 6.21. In other 
words, Ho will not be rejected unless the observed values x1, ..., x, in the sample are 
at least 6.21 times as likely under H; as they are under A). In this case, the procedure 
5* therefore satisfies the experimenter’s desire to use a test that is cautious about 
rejecting Hp. 

If n = 100, however, the procedure 6* will reject Hy whenever the likelihood 
ratio exceeds the value k =2.5 x 10-!”. Therefore, Ho will be rejected for certain 
observed values x1,..., x, that are actually millions of times more likely under Ho 
as they are under H;. The reason for this result is that the value of 6(5*) that can be 
achieved when n = 100, whichis 8 x 10~!9, is extremely small relative to the specified 
value a = 0.01. Hence, the procedure 6* actually turns out to be much more cautious 
about an error of type II than it is about an error of type I. We can see from this 
discussion that a value of a that is an appropriate choice for a small value of n might 
be unnecessarily large for a large value of n. Hence, it would be sensible to let the 
level of significance ap decrease as the sample size increases. 
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Suppose now that the experimenter regards an error of type I to be much more 
serious than an error of type IJ, and she therefore desires to use a test procedure for 
which the value of the linear combination 100a(6) + B(6) will be a minimum. Then it 
follows from Theorem 9.2.1 that she should reject Hp if and only if the likelihood 
ratio exceeds the value k = 100, regardless of the sample size n. In other words, 
the procedure that minimizes the value of 100a(6) + 6(6) will not reject Hp unless 
the observed values x1,..., x, are at least 100 times as likely under H; as they are 
under Hp. 

From this discussion, it seems more reasonable for the experimenter to take the 
values of both w(5) and 6(5) into account when choosing a test procedure, rather 
than to fix a value of w(5) and minimize 6(5). For example, one could minimize the 
value of a linear combination of the form aa(é) + bB(6). In Sec. 9.8, we saw how the 
Bayesian point of view also leads to the conclusion that one should try to minimize 
a linear combination of this form. Lehmann (1958) suggested choosing a number k 
and requiring that 6(5) = ka(6). Both the Bayesian method and Lehmann’s method 
have the advantage of forcing the probabilities of both type I and type IJ errors to 
decrease as one obtains more data. Similar problems with fixing the significance level 
of a test arise when hypotheses are composite, as we illustrate later in this section. 


Statistically Significant Results 


When the observed data lead to rejecting a null hypothesis Hp at level qo, it is often 
said that one has obtained a result that is statistically significant at level ag. When 
this occurs, it does not mean that the experimenter should behave as if Hp is false. 
Similarly, if the data do not lead to rejecting Hp, the result is not statistically significant 
at level ap, but the experimenter should not necessarily become convinced that Ho 
is true. Indeed, qualifying “significant” with the term “statistically” is a warning that 
a Statistically significant result might be different than a practically significant result. 
Consider, once again, Example 9.5.10 on page 582, in which the hypotheses to be 
tested are 


Ao: L= 5.2, 
Ay: pL # 5.2, 


It is extremely important for the experimenter to distinguish a statistically significant 
result from any claim that the parameter jy is significantly different from the hypoth- 
esized value 5.2. Even if the data suggest that jz is not equal to 5.2, this does not 
necessarily provide any evidence that the actual value of jx is significantly different 
from 5.2. For a given set of data, the tail area corresponding to the observed value of 
the test statistic U might be very small, and yet the data might suggest that the actual 
value of jx is so close to 5.2 that, for practical purposes, the experimenter would not 
regard yu as being significantly different from 5.2. 

The situation just described can arise when the statistic U is based on a very large 
random sample. Suppose, for instance, that in Example 9.5.10 the lengths of 20,000 
fibers in a random sample are measured, rather than the lengths of only 15 fibers. For 
a given level of significance, say, a = 0.05, let (uu, o7|5) denote the power function 
of the ¢ test based on these 20,000 observations. Then 7 (5.2, o2|5) = 0.05 for every 
value of o? > 0. However, because of the very large number of observations on which 
the test is based, the power (1, o7|6) will be very close to 1 for each value of ju that 
differs only slightly from 5.2 and for a moderate value of o”. In other words, even 
if the value of w differs only slightly from 5.2, the probability is close to 1 that one 
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would obtain a statistically significant result. For example, with n = 20,000, the power 
of the level 0.05 test when |p — 5.2| = 0.030 is 0.99. 

As explained in Sec. 9.4, it is inconceivable that the mean length yp of all the 
fibers in the entire population will be exactly 5.2. However, j4 may be very close to 
5.2, and when it is, the experimenter will not want to reject the null hypothesis Hp. 
Nevertheless, it is very likely that the t test based on the sample of 20,000 fibers will 
lead to a statistically significant result. Therefore, when an experimenter analyzes a 
powerful test based on a very large sample, he must exercise caution in interpreting 
the actual significance of a “statistically significant” result. He knows in advance that 
there is a high probability of rejecting Hp even when the true value of jx differs only 
slightly from the value 5.2 specified under Hp. 

One way to handle this situation, as discussed earlier in this section, is to rec- 
ognize that a level of significance much smaller than the traditional value of 0.05 or 
0.01 is appropriate for a problem with a large sample size. Another way is to replace 
the single value of yz in the null hypothesis by an interval, as we did on pages 571 
and 610. A third way is to regard the statistical problem as one of estimation rather 
than one of testing hypotheses. 

When a large random sample is available, the sample mean and the sample vari- 
ance will be excellent estimators of the parameters jz and”. Before the experimenter 
chooses any decision involving the unknown values of jz and o?, she should calculate 
and consider the values of these estimators as well as the value of the statistic U. 


Summary 


When we reject a null hypothesis, we say that we have obtained a statistically sig- 
nificant result. The power function of a level ag test becomes very large, even for 
parameter values close to the null hypothesis, as the size of the sample increases. For 
the case of simple hypotheses, the probability of type II error can become very small 
while the probability of type I error stays as large as aj. One way to avoid this is 
to let the level of significance decrease as the sample size increases. If one rejects a 
null hypothesis at a particular level of significance ap, one must be careful to check 
whether the data actually suggest any deviation of practical importance from the null 
hypothesis. 


Exercises 


1. Suppose that a single observation X is taken from the 
normal distribution with unknown mean yw and known 
variance is 1. Suppose that it is known that the value of 
j must be —5, 0, or 5, and it is desired to test the following 
hypotheses at the level of significance 0.05: 


Ho: h= 0, 
Ay: w=—Soru=S. 
Suppose also that the test procedure to be used specifies 


rejecting Hy when |X| > c, where the constant c is chosen 
so that Pr(|X| > c|4 =0) = 0.05. 


a. Find the value of c, and show that if X = 2, then Ho 
will be rejected. 


b. Show that if X = 2, then the value of the likelihood 
function at jz = 0 is 12.2 times as large as its value at 
y =5 and is 5.9 x 10° times as large as its value at 
w=—S. 


2. Suppose that arandom sample of 10,000 observations is 
taken from the normal distribution with unknown mean 
ju and known variance is 1, and it is desired to test the 
following hypotheses at the level of significance 0.05: 

Ao: h= 0, 

Ay: pL x 0. 
Suppose also that the test procedure specifies rejecting 
Ho when |X,,| > c, where the constant c is chosen so that 
Pr(|X,,| > clu =0) =0.05. Find the probability that the 


test will reject Hp if (a) the actual value of yw is 0.01, and 
(b) the actual value of yu is 0.02. 


3. Consider again the conditions of Exercise 2, but sup- 
pose now that it is desired to test the following hypotheses: 


Ho: ps 0, 
Ay: u> 0. 


Suppose also that in the random sample of 10,000 ob- 
servations, the sample mean X,, is 0.03. At what level of 
significance is this result just significant? 


4. Suppose that X;,..., X, comprise a random sample 
from the normal distribution with unknown mean 6 and 
known variance 1. Suppose that it is desired to test the 
same hypotheses as in Exercise 3. This time, however, 
the test procedure 6 will be chosen so as to minimize 
197 (0/5) + 1 — 1 (0.5]6). 
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a. Find the value c,, so that the test procedure 46 rejects 
Ho if X,, > c, for each value n = 1, n = 100, andn = 
10,000. 

b. For each value of n in part (a), find the size of the test 


procedure 6. 


5. Suppose that X;,..., X, comprise a random sample 
from the normal distribution with unknown mean @ and 
variance 1. Suppose that it is desired to test the same 
hypotheses as in Exercise 3. This time, however, the test 
procedure 6 will be chosen so that 192 (0|5) = 1 — 2(0.5|6). 
a. Find the value c,, so that the test procedure 4 rejects 
Hp if X,, >c, for each value n = 1, n = 100, andn = 
10,000. 


b. For each value of n in part (a), find the size of the test 
procedure 6. 


9.10 Supplementary Exercises 


1. I will flip a coin three times and let X stand for the 
number of times that the coin comes up heads. Let 6 stand 
for the probability that the coin comes up heads on a single 
flip, and assume that the flips are independent given @. I 
wish to test the null hypothesis Hp: 6 = 1/2 against the 
alternative hypothesis H,:6 = 3/4. Find the test 6 that 
minimizes a(5) + 6(6), the sum of the type I and type II 
error probabilities, and find the two error probabilities for 
the test. 


2. Suppose that a sequence of Bernoulli trials is to be 
carried out with an unknown probability 6 of success on 
each trial, and the following hypotheses are to be tested: 


Ho: G= 0.1, 
A: 6 =0.2. 


Let X denote the number of trials required to obtain a 
success, and suppose that Hp is to be rejected if X <5. 
Determine the probabilities of errors of type I and type II. 


3. Consider again the conditions of Exercise 2. Suppose 
that the losses from errors of type I and type II are equal, 
and the prior probabilities that Hj) and H, are true are 
equal. Determine the Bayes test procedure based on the 
observation X. 


4. Suppose that a single observation X is to be drawn from 
the following p.d.f.: 


21—-—0)x+0 for0<x <1, 
0 otherwise, 


rosie) =| 


where the value of 6 is unknown (0 < 6 < 2). Suppose also 
that the following hypotheses are to be tested: 


Ho: 0 =?2, 
A: 6=0. 


Determine the test procedure 6 for which a(5) + 28(6) is 
a minimum, and calculate this minimum value. 


5. Consider again the conditions of Exercise 4, and sup- 
pose that (6) is required to be a given value ap (0 < ag < 
1). Determine the test procedure 5 for which £(6) will be 
a minimum, and calculate this minimum value. 


6. Consider again the conditions of Exercise 4, but sup- 
pose now that the following hypotheses are to be tested: 


Ho: 60> i 
A: 6<1. 
a. Determine the power function of the test 5 that spec- 
ifies rejecting Hp if X > 0.9. 
b. What is the size of the test 5? 


7. Consider again the conditions of Exercise 4. Show that 
the p.d-f. f(x«|@) has a monotone likelihood ratio in the 
statistic r(X) = —X, and determine a UMP test of the 
following hypotheses at the level of significance ag = 0.05: 


Ho: 0 
Ay: 0 


VIA 


5: 
8. Suppose that a box contains a large number of chips of 
three different colors, red, brown, and blue, and it is de- 
sired to test the null hypothesis Ho that chips of the three 
colors are present in equal proportions against the alter- 
native hypothesis H, that they are not present in equal 
proportions. Suppose that three chips are to be drawn at 
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random from the box, and Ap is to be rejected if and only 
if at least two of the chips have the same color. 
a. Determine the size of the test. 


b. Determine the power of the test if 1/7 of the chips 
are red, 2/7 are brown, and 4/7 are blue. 


9. Suppose that a single observation X is to be drawn from 
an unknown distribution P, and that the following simple 
hypotheses are to be tested: 


Ho: P is the uniform distribution on the interval [0, 1], 


Hi: P is the standard normal distribution. 


Determine the most powerful test of size 0.01, and calcu- 
late the power of the test when Hy is true. 


10. Suppose that the 12 observations Xj, ..., Xj. form 
a random sample from the normal distribution with un- 
known mean w and unknown variance a2. Describe how 
to carry out a ¢ test of the following hypotheses at the level 
of significance a = 0.005: 


Ho: h= 3, 

Ay: h< 3: 
11. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean 6 and known 
variance 1, and it is desired to test the following hypothe- 
ses: 

Ho: O< 0, 

Ay: 6>0. 
Suppose also that it is decided to use a UMP test for which 


the power is 0.95 when 6 = 1. Determine the size of this 
test ifn = 16. 


12. Suppose that eight observations X;,..., Xgare drawn 
at random from a distribution with the following p.d-f.: 


6-1 
poste) = {6 for0<x <1, 

0 otherwise. 
Suppose also that the value of 6 is unknown (6 > 0), and 
it is desired to test the following hypotheses: 

Ho: O< 1, 

Ay: 6>1. 
Show that the UMP test at the level of significance apy = 
0.05 specifies rejecting Ho if yy log X; => —3.981. 
13. Suppose that Xj, ..., X,, form a random sample from 
the x? distribution with unknown degrees of freedom 6 
(9 =1,2,...), and it is desired to test the following hy- 
potheses at a given level of significance ag (0 < ag < 1): 

Ho: O< 8, 

A: 06> 9. 
Show that there exists a UMP test, and the test specifies 
rejecting Hp if }°?_, log X; => k for some appropriate con- 
stant k. 


14. Suppose that Xj, ..., Xj) formarandom sample from 
a normal distribution for which both the mean and the 
variance are unknown. Construct a statistic that does not 
depend on any unknown parameters and has the F distri- 
bution with three and five degrees of freedom. 


15. Suppose that Xj, ..., X,, formarandom sample from 
the normal distribution with unknown mean j2; and un- 
known variance or, and that Y;,..., Y,, form an indepen- 
dent random sample from the normal distribution with 
unknown mean jz and unknown variance ae. Suppose 
also that it is desired to test the following hypotheses with 
the usual F test at the level of significance ag = 0.05: 


Ho: (or 


(So 
Ay: or (ofa 
Assuming that m = 16 and n = 21, show that the power of 
the test when ey a 205 is given by Pr(V* > 1.1), where V* 
is arandom variable having the F distribution with 15 and 


20 degrees of freedom. 


16. Suppose that the nine observations Xj, ..., Xo form 
a random sample from the normal distribution with un- 
known mean i; and unknown variance o7, and the nine 
observations Y;,..., Yo form an independent random 
sample from the normal distribution with unknown mean 
jt) and the same unknown variance o7. Let Se and Se be 
as defined in Eq. (9.6.2) (with m =n = 9), and let 


<8 
T =max —, iM. 
Sy § 


ro 
x 
Determine the value of the constant c such that Pr(T > 
c) = 0.05. 


17. An unethical experimenter desires to test the follow- 
ing hypotheses: 


Ho: 6=06, 
A: 64%. 


She draws a random sample X,,..., X, from a distribu- 
tion with the p.d.f. f(x|@), and carries out a test of size a. If 
this test does not reject Ho, she discards the sample, draws 
a new independent random sample of n observations, and 
repeats the test based on the new sample. She continues 
drawing new independent samples in this way until she 
obtains a sample for which Hp is rejected. 


a. What is the overall size of this testing procedure? 
b. If Hp is true, what is the expected number of samples 


that the experimenter will have to draw until she 
rejects Hp? 


18. Suppose that Xj, ..., X,, form a random sample from 
the normal distribution with unknown mean yw and un- 
known precision t, and the following hypotheses are to 


be tested: 
Ao: hs 3, 
Ay: u> 3: 


Suppose that the prior joint distribution of 4 and t is 
the normal-gamma distribution, as described in Theorem 
8.6.1, with Wo = 3, Ag = 1, ag = 1, and fp = 1. Suppose fi- 
nally that n = 17, and it is found from the observed val- 
ues in the sample that X,, = 3.2 and )-?_,(X; — X,)? =17. 
Determine both the prior probability and the posterior 
probability that Hp is true. 


19. Consider a problem of testing hypotheses in which the 
following hypotheses about an arbitrary parameter 0 are 
to be tested: 


Ho: Oe Qo, 
A: Oe Qy. 


Suppose that 6 is a test procedure of size a (0 <a < 1) 
based on some vector of observations X, and let 7(6|6) 
denote the power function of 5. Show that if 6 is unbiased, 
then 2(6|5) > @ at every point 6 € Q). 


20. Consider again the conditions of Exercise 19. Suppose 
now that we have a two-dimensional vector 6 = (64, 42), 
where 0, and 6 are real-valued parameters. Suppose also 
that A is a particular circle in the 6,6,-plane, and that the 
hypotheses to be tested are as follows: 


Ho: O&A, 
Ay: OA. 


Show that if the test procedure 64 is unbiased and of size a, 
and if its power function 7(0|5) is a continuous function 
of 6, then it must be true that 7(6|5) =a at each point 0 
on the boundary of the circle A. 


21. Consider again the conditions of Exercise 19. Suppose 
now that 6 is a real-valued parameter, and the following 
hypotheses are to be tested: 


Ho: 0= 60, 
Ay: 0 F Oo. 


Assume that 69 is an interior point of the parameter 
space Q. Show that if the test procedure 4 is unbiased and 
if its power function 7 (6|6) is a differentiable function of 
0, then zr'(69|5) = 0, where z’(6|5) denotes the derivative 
of 2 (0|5) evaluated at the point 6 = 6p. 


22. Suppose that the differential brightness 6 of a certain 
star has an unknown value, and it is desired to test the 
following simple hypotheses: 


Ho: 0= 0, 
Ay: 6=10. 
The statistician knows that when he goes to the observa- 


tory at midnight to measure 0, there is probability 1/2 that 
the meteorological conditions will be good, and he will be 
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able to obtain a measurement X having the normal dis- 
tribution with mean @ and variance 1. He also knows that 
there is probability 1/2 that the meteorological conditions 
will be poor, and he will obtain a measurement Y having 
the normal distribution with mean @ and variance 100. The 
statistician also learns whether the meteorological condi- 
tions were good or poor. 


a. Construct the most powerful test that has conditional 
size a = 0.05, given good meteorological conditions, 
and one that has conditional size w = 0.05, given poor 
meteorological conditions. 


b. Construct the most powerful test that has condi- 
tional size a = 2.0 x 107’, given good meteorolog- 
ical conditions, and one that has conditional size 
a = 0.0999998, given poor meteorological condi- 
tions. (You will need a computer program to do this.) 


c. Show that the overall size of both the test found in 
part (a) and the test found in part (b) is 0.05, and 
determine the power of each of these two tests. 


23. Consider again the situation described in Exercise 22. 
This time, assume that there is a loss function of the form 
(9.8.6). Also, assume that the prior probability of 6 = 0 is 
& and the prior probability of 6 = 10 is €. 


a. Find the formula for the Bayes test for general loss 
function of the form (9.8.6). 


b. Prove that the test in part (a) of Exercise 22 is not a 
special case of the Bayes test found in part (a) of the 
present exercise. 


c. Prove that the test in part (b) of Exercise 22 is (up to 
rounding error) a special case of the Bayes test found 
in part (a) of the present exercise. 


24. Let X;,..., X,, be ii.d. with the Poisson distribution 
having mean 0. Let Y = )°"_, X;. 


a. Suppose that we wish to test the hypotheses Hp :6 > 1 
versus H,:6 <1. Show that the test “reject Hp if 
Y = 0” is uniformly most powerful level ag for some 
number ap. Also find ap. 


b. Find the power function of the test from part (a). 


25. Consider a family of distributions with parameter 0 
and monotone likelihood ratio in a statistic T. We learned 
how to find a uniformly most powerful level ap test 6, of 
the null hypothesis Hp ..:@ <c versus H,.:@ > c for every 
c. We also know that these tests are equivalent to a co- 
efficient 1 — ap confidence interval, where the confidence 
interval contains c if and only if 5, does not reject Ho... 
The confidence interval is called uniformly most accurate 
coefficient 1 — aj. Based on the equivalence of the tests 
and the confidence interval, figure out what the definition 
of “uniformly most accurate coefficient 1 — a” must be. 
Write the definition in terms of the conditional probabil- 
ity that the interval covers 6, given that 6 = 6 for various 
pairs of values 6; and 6). 
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10.1 Tests of Goodness-of-Fit 


In some problems, we have one specific distribution in mind for the data we will 
observe. If that one distribution is not appropriate, we do not necessarily have a 
parametric family of alternative distributions in mind. In these cases, and others, we 
can Still test the null hypothesis that the data come from the one specific distribution 
against the alternative hypothesis that the data do not come from that distribution. 


Description of Nonparametric Problems 


Failure Times of Ball Bearings. In Example 5.6.9, we observed the failure times of 23 
ball bearings, and we modeled the logarithms of these failure times as normal random 
variables. Suppose that we are not so confident that the normal distribution is a good 
model for the logarithms of the failure times. Is there a way to test the null hypothesis 
that a normal distribution is a good model against the alternative that no normal 
distribution is a good model? Is there a way to estimate features of the distribution 
of failure times (such as the median, variance, etc.) if we are unwilling to model the 
data as normal random variables? < 


In each of the problems of estimation and testing hypotheses that we considered 
in Chapters 7, 8, and 9, we have assumed that the observations that are available to the 
statistician come from distributions for which the exact form is known, even though 
the values of some parameters are unknown. For example, it might be assumed 
that the observations form a random sample from a Poisson distribution for which 
the mean is unknown, or it might be assumed that the observations come from 
two normal distributions for which the means and variances are unknown. In other 
words, we have assumed that the observations come from a certain parametric family 
of distributions, and a statistical inference must be made about the values of the 
parameters defining that family. 

In many of the problems to be discussed in this chapter, we shall not assume that 
the available observations come from a particular parametric family of distributions. 
Rather, we shall study inferences that can be made about the distribution from which 
the observations come, without making special assumptions about the form of that 
distribution. As one example, we might simply assume that the observations form 
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a random sample from a continuous distribution, without specifying the form of 
this distribution any further, and we might then investigate the possibility that this 
distribution is a normal distribution. As a second example, we might be interested 
in making an inference about the value of the median of the distribution from which 
the sample was drawn, and we might assume only that this distribution is continuous. 
As a third example, we might be interested in investigating the possibility that two 
independent random samples actually come from the same distribution, and we 
might assume only that both distributions from which the samples are taken are 
continuous. 

Problems in which the possible distributions of the observations are not re- 
stricted to a specific parametric family are called nonparametric problems, and the 
statistical methods that are applicable in such problems are called nonparametric 
methods. 


Categorical Data 


Blood Types. In Example 5.9.3, we learned about a study of blood types among a 
sample of 6004 white Californians. Suppose that the actual counts of people with the 
four blood types are given in Table 10.1. We might be interested in whether or not 
these data are consistent with a theory that predicts a particular set of probabilities 
for the blood types. Table 10.2 gives theoretical probabilities for the four blood types. 
How can we go about testing the null hypothesis that the theoretical probabilities in 
Table 10.2 are the probabilities with which the data in Table 10.1 were sampled? < 


In this section and the next four sections, we shall consider statistical problems 
based on data such that each observation can be classified as belonging to one of 
a finite number of possible categories or types. Observations of this type are called 
categorical data. Since there are only a finite number of possible categories in these 
problems, and since we are interested in making inferences about the probabilities of 
these categories, these problems actually involve just a finite number of parameters. 
However, as we shall see, methods based on categorical data can be usefully applied 
in both parametric and nonparametric problems. 


Table 10.1 Counts of blood types for white 


Californians 
A B AB O 
2162 738 228 2876 


Table 10.2 Theoretical probabilities of blood 
types for white Californians 


1/3 1/8 1/24 1/2 
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Theorem 
10.1.1 


The x? Test 


Suppose that a large population consists of items of k different types, and let p; 
denote the probability that an item selected at random will be of typei (i =1,..., 4). 
Example 10.1.2 is of this type with k =4. Of course, p; >0 fori=1,...,k and 
ae p; = 1. Let pi, ee pe be specific numbers such that pe >Ofori=1,...,k 
and a p = 1, and suppose that the following hypotheses are to be tested: 


HA: pj=p? fori=1,...,k, (10.1.1) 
Ay: pF p? for at least one value of i. 


We shall assume that a random sample of size n is to be taken from the given 
population. That is, n independent observations are to be taken, and there is proba- 
bility p; that each observation will be of type i (i =1,..., k). On the basis of these 
n observations, the hypotheses (10.1.1) are to be tested. 

Fori =1,...,k, we shall let N; denote the number of observations in the random 
sample that are of type i. Thus, N,,..., NM, are nonnegative integers such that 
pee N; =n. Indeed, (Nj, ..., N,,) has the multinomial distribution (see Sec. 5.9) 
with parameters n and p= (pj, ..., p,). When the null hypothesis Hp is true, the 
expected number of observations of type i is np? (i =1,...,k). The difference 
between the actual number of observations N; and the expected number np? will tend 
to be smaller when Hp is true than when Hp is not true. It seems reasonable, therefore, 
to base a test of the hypotheses (10.1.1) on values of the differences N; — np? for 
i=1,...,k and reject Hy when the magnitudes of these differences are relatively 
large. 

In 1900, Karl Pearson proved the following result, whose proof will not be given 
here. 


x° Statistic. The following statistic 
oo a ae (10.1.2) 


has the property that if Hp is true and the sample size n > ov, then Q converges in dis- 
tribution to the x? distribution with k — 1 degrees of freedom. (See Definition 6.3.1.) 
rT 


Theorem 10.1.1 says that if Hp is true and the sample size n is large, the distribution 
of Q will be approximately the x? distribution with k — 1 degrees of freedom. The 
discussion that we have presented indicates that Hy should be rejected when Q > c, 
where c is an appropriate constant. If it is desired to carry out the test at the level of 
significance a, then c should be chosen to be the 1 — a quantile of the x7 distribution 
with k — 1 degrees of freedom. This test is called the x? test of goodness-of-fit. 


Note: General form of x” test statistic. The form of the statistic Q in (10.1.2) is 
common to all x? tests including those that will be introduced later in this chap- 
ter. The form is a sum of terms, each of which is the square of the difference be- 
tween an observed count and an expected count divided by the expected count: 
> (observed—expected)*/expected. The expected counts are computed under the 
assumption that the null hypothesis is true. 

Whenever the value of each expected count, n pe (i =1,...,k),is not too small, 
the x? distribution will be a good approximation to the actual distribution of Q. 
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Specifically, the approximation will be very good if n pe >Sfori=1,...,k, and the 
approximation should still be satisfactory if np? >15fori=1,...,k. 
We shall now illustrate the use of the x test of goodness-of-fit by some examples. 


Blood Types. In Example 10.1.2, we have specified a hypothetical vector of proba- 
bilities ( rae sae P!) for the four blood types in Table 10.2. We can use the data in 
Table 10.1 to test the null hypothesis Hp that the probabilities (p,,..., p4) of the four 
blood types equal ( P. suid Pp). The four expected counts under Hp are 


np} = 6004 x : = 2001.3, np} = 6004 x : = 750.5, 


np? = 6004 x a 250.2, and np? = 6004 x 1 _ 3002.0. 
3 24 " 2 


The x? test statistic is then 


2 2 2 
O= (2162 — 2001.3) ‘ 738 — 750.5 " (228 — 250.2) a. (2876 — 3002.0) — 90,37, 
2001.3 750.5 250.2 3002.0 
To test Hy at level ay, we would compare Q to the 1 — ap quantile of the x? distribution 
with three degrees of freedom. Alternatively, we can compute the p-value, which 
would be the smallest a at which we could reject Hp. In the case of the x7 goodness of 
fit test, the p-value equals 1 — p came co)? where Ye is the c.d.f. of the x? distribution 


with k — 1 degrees of freedom. In this example, k = 4 and the p-value is 1.42 x 1074. 
< 


Montana Outlook Poll. The Bureau of Business and Economic Research at the Uni- 
versity of Montana conducted a poll of opinions of Montana residents in May 1992. 
Among other things, respondents were asked whether their personal financial status 
was worse, the same, or better than one year ago. Table 10.3 displays some results. We 
might be interested in whether the respondents’ answers are uniformly distributed 
over the three possible responses. That is, we can test the null hypothesis that the 
probabilities of the three responses are all equal to 1/3. We calculate 


(58 — 189/3)?, (64 — 189/3)2 (67 — 189/3)? 
189/3 189/3 189/3 


Q= = 0.6667. 


Since 0.6667 is the 0.283 quantile of the x? distribution with two degrees of freedom, 
we would only reject the null at levels greater than 1 — 0.283 = 0.717. < 


Testing Hypotheses about a Proportion. Suppose that the proportion p of defective 
items in a large population of manufactured items is unknown and that the following 


Table 10.3 Responses to personal financial 
status question from Montana 
Outlook Poll 


Worse Same Better Total 


58 64 67 189 
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hypotheses are to be tested: 
Ho: p= 0.1, 
A: Dp # 0.1. 


Suppose also that in a random sample of 100 items, it is found that 16 are defective. 
We shall test the hypotheses (10.1.3) by carrying out a x test of goodness-of-fit. 

Since there are only two types of items in this example, namely, defective items 
and nondefective items, we know that k = 2. Furthermore, if we let p,; denote the 
unknown proportion of defective items and let p, denote the unknown proportion 
of nondefective items, then the hypotheses (10.1.3) can be rewritten in the following 
form: 


(10.1.3) 


Ho: y= 0.1 and P2= 0.9, 


. . (10.1.4) 
H,;: The hypothesis Hp is not true. 


For the sample size n = 100, the expected number of defective items if Ho is 
true is np = 10, and the expected number of nondefective items is nps = 90. Let N; 
denote the number of defective items in the sample, and let N, denote the number 
of nondefective items in the sample. Then, when A) is true, the distribution of the 
statistic Q defined by Eq. (10.1.2) will be approximately the x distribution with one 
degree of freedom. 

In this example, N, = 16 and N> = 84, and it is found that the value of Q is 4. It 
can now be determined, either from interpolation in a table of the x? distribution 
with one degree of freedom or from statistical software, that the tail area (p-value) 
corresponding to the value Q = 4is approximately 0.0455. Hence, the null hypothesis 
Ho would be rejected at levels of significance greater than 0.0455, but not at smaller 
levels. For hypotheses about a single proportion, we developed tests in Sec. 9.1. (See 
Exercise 11 in Sec. 9.1, for example.) You can compare the test from Sec. 9.1 to the 
test in this example in Exercise 1 at the end of this section. < 


Testing Hypotheses about a Continuous Distribution 


Consider a random variable X that takes values in the interval 0 < X < 1 but has an 
unknown p.d.f. over this interval. Suppose that a random sample of 100 observations 
is taken from this unknown distribution, and it is desired to test the null hypothe- 
sis that the distribution is the uniform distribution on the interval [0, 1] against the 
alternative hypothesis that the distribution is not uniform. This problem is a nonpara- 
metric problem, since the distribution of X might be any continuous distribution on 
the interval [0, 1]. However, as we shall now show, the x? test of goodness-of-fit can 
be applied to this problem. 

Suppose that we divide the interval [0, 1] into 20 subintervals of equal length, 
namely, the interval [0, 0.05), the interval [0.05, 0.10), and so on. If the actual distri- 
bution is a uniform distribution, then the probability that each observation will fall 
within the ith subinterval is 1/20, fori = 1, ..., 20. Since the sample size in this exam- 
ple isn = 100, it follows that the expected number of observations in each subinterval 
is 5. If N; denotes the number of observations in the sample that actually fall within 
the ith subinterval, then the statistic Q defined by Eq. (10.1.2) can be rewritten simply 
as follows: 


1 20 
O= Yi; aS). (10.1.5) 
i=l 
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If the null hypothesis is true, and the distribution from which the observations 
were taken is indeed a uniform distribution, then Q will have approximately the 
x? distribution with 19 degrees of freedom. 

The method that has been presented in this example obviously can be applied 
to every continuous distribution. To test whether a random sample of observations 
comes from a particular distribution, the following procedure can be adopted: 


i. Partition the entire real line, or any particular interval that has probability 1, 
into a finite number k of disjoint subintervals. Generally, k is chosen so that the 
expected number of observations in each subinterval is at least 5 if Ho is true. 

ii. Determine the probability pe that the particular hypothesized distribution 
would assign to the ith subinterval, and calculate the expected number np? 
of observations in the ith subinterval (i =1,..., k). 

iii. Count the number WN; of observations in the sample that fall within the ith 
subinterval (i =1,...,k). 

iv. Calculate the value of Q as defined by Eq. (10.1.2). If the hypothesized dis- 
tribution is correct, Q will have approximately the x? distribution with k — 1 
degrees of freedom. 


Failure Times of Ball Bearings. Return to Example 10.1.1. Suppose that we wish to use 
the x? test to test the null hypothesis that the logarithms of the lifetimes are an i.i.d. 
sample from the normal distribution with mean log(50) = 3.912 and variance 0.25. 
In order to have the expected count in each interval be at least 5, we can use at most 
k =4 intervals. We shall make these intervals each have probability 0.25 under the 
null hypothesis. That is, we shall divide the intervals at the 0.25, 0.5, and 0.75 quantiles 
of the hypothesized normal distribution. These quantiles are 


3.912 + 0.5@—1(0.25) = 3.192 +.0.5 x (—0.674) = 3.575, 
3.912 +0.5671(0.5) =3.192+0.5 x 0 — 3.912, 
3.912 + 0.56—1(0.75) = 3.192 +0.5 x 0.674 =4.249, 


because the 0.25 and 0.75 quantiles of the standard normal distribution are +0.674. 
The observed logarithms are 


2.88 3.36 3.50 3.73 3.74 3.82 3.88 3.95 
3.95 3.99 4.02 4.22 4.23 4.23 4.23 4.43 
4.53 4.59 4.66 4.66 4.85 4.85 5.16 


The numbers of observations in each of the four intervals are then 3, 4, 8, and 8. We 
then calculate 


_ 3-23 x 0.25)? (4-23 x 0.25)? (8 — 23 x 0.25)? 


Q= 23 x 0.25 23 x 0.25 23 x 0.25 
= 2 
(8 — 23 x 0.25)? _ agen 
23 x 0.25 


Our table of the x? distribution with three degrees of freedom indicates that 3.609 
is between the 0.6 and 0.7 quantiles, so we would not reject the null hypothesis at 
levels less 0.3 and reject the null hypothesis at levels greater than 0.4. (Actually, the 
p-value is 0.307.) < 
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One arbitrary feature of the procedure just described is the way in which the 
subintervals are chosen. Two statisticians working on the same problem might very 
well choose the subintervals in two different ways. Generally speaking, it is a good 
policy to choose the subintervals so that the expected numbers of observations in 
the individual subintervals are approximately equal, and also to choose as many 
subintervals as possible without allowing the expected number of observations in 
any subinterval to become small. This is what we did in Example 10.1.6. 


Likelihood Ratio Tests for Proportions 


Example 
10.1.7 


In Examples 10.1.3 and 10.1.4, we used the x” goodness-of-fit test to test hypotheses 
of the form (10.1.4). Although x? tests are commonly used in such examples, we could 
actually use parametric tests in these examples. For example, the vector of responses 
in Table 10.3 can be thought of as the observed value of a multinomial random 
vector with parameters 189 and p = (pj, p>, p3). (See Sec. 5.9.) The hypotheses in 
Eq. (10.1.4) are then of the form 


Ho: p =p” versus H,: Hp is not true. 


As such, we can use the method of likelihood ratio tests for testing the hypotheses. 
Specifically, we shall apply Theorem 9.1.4. The likelihood function from a multino- 
mial vector x = (Nj, ..., Nz) is 


= nm Ny Ne 
flip) = (\, _ - Py Pp, (10.1.6) 
In order to apply Theorem 9.1.4, the parameter space must be an open set in k- 
dimensional space. This is not true for the multinomial distribution if we let p be the 
parameter. The set of probability vectors lies in a (k — 1)-dimensional subset of k- 
dimensional space because the coordinates are constrained to add up to 1. However, 
we can just as effectively treat the vector 6 = (pj, ..., p,_1) a8 the parameter because 
Py =1- py —-++ + — pe_1i8 a function of @. As long as we believe that all coordinates 
of p are strictly between 0 and 1, the set of possible values of the (k — 1)-dimensional 
parameter @ is open. The likelihood function (10.1.6) can then be rewritten as 


n N. Ny 
x|0) = 68 =O ce ea 10.1.7 
g(x|0) (yw, Ja 1 & 1 k—) ( ) 


If Ap is true, there is only one possible value for (10.1.7), namely, 


n (O)\ Ny... 7 (Ng 
(yn J@ ) (p, )*, 


which is then the numerator of the likelihood ratio statistic A(x) from Defini- 
tion 9.1.11. The denominator of A (x) is found by maximizing (10.1.7). It is not difficult 
to show that the M.L.E.’s are 6; = N;/n fori =1,...,k —1. The large-sample likeli- 
hood ratio test statistic is then 


k np 
—2 log A(x) = —2 » N;, log (| ; 


i=1 i 
The large-sample test rejects Ho at level of significance ag if this statistic is greater 
than the 1 — a quantile of the x7 distribution with k — 1 degrees of freedom. 


Blood Types. Using the data in Table 10.1, we can test the null hypothesis that the 


vector of probabilities equals the vector of numbers in Table 10.2. The values of n > 
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for i = 1, 2, 3, 4 were already calculated in Example 10.1.3. The test statistic is 


2001.3 750.5 250.2 3002.0 
29; 2462 to | = | 4988 lop | 22 4 oR Ian | 2 | 20876 Ios (2 
og (SU) + og (OP) + og (2) + og ( ) 


= 20.16. 
The p-value is the probability that a x* random variable with three degrees of 


freedom is greater than 20.16, namely, 1.57 x 10~4. This is nearly the same as the 
p-value from the x? test in Example 10.1.3. 4 


, 
“oo 


Discussion of the Test Procedure 


The x? test of goodness-of-fit is subject to the criticisms of tests of hypotheses 
that were presented in Sec. 9.9. In particular, the null hypothesis Hp in the x? test 
specifies the distribution of the observations exactly, but it is not likely that the actual 
distribution of the observations will be exactly the same as that of a random sample 
from this specific distribution. Therefore, if the x” test is based on a very large number 
of observations, we can be almost certain that the tail area corresponding to the 
observed value of Q will be very small. For this reason, a very small tail area should 
not be regarded as strong evidence against the hypothesis Hp without further analysis. 
Before a statistician concludes that the hypothesis Ho is unsatisfactory, he should be 
certain that there exist reasonable alternative distributions for which the observed 
values provide a much better fit. For example, the statistician might calculate the 
values of the statistic OQ for a few reasonable alternative distributions in order to be 
certain that, for at least one of these distributions, the tail area corresponding to the 
calculated value of Q is substantially larger than it is for the distribution specified by 
Hp. 

A particular feature of the x? test of goodness-of-fit is that the procedure is 
designed to test the null hypothesis Hp that p; = pe fori=1,..., k against the general 
alternative that Hp is not true. If it is desired to use a test procedure that is especially 
effective for detecting certain types of deviations of the actual values of pj, ..., Dx 
from the hypothesized values Py. bis P?, then the statistician should design special 
tests that have higher power for these types of alternatives and lower power for 
alternatives of lesser interest. This topic will not be discussed in this book. 

Because the random variables Nj, ..., N; in Eq. (10.1.2) are discrete, the x7 
approximation to the distribution of Q can sometimes be improved by introducing a 
correction for continuity of the type described in Sec. 6.4. However, we shall not use 


the correction in this book. 


Summary 


The x7 test of goodness-of-fit was introduced as a method for testing the null hy- 
pothesis that our data form an iid. sample from a specific distribution against the 
alternative hypothesis that the data have some other distribution. The test is most 
natural when the specific distribution is discrete. Suppose that there are k possible 
values for each observation, and we observe N; with value i fori = 1, ..., k. Suppose 
that the null hypothesis says that the probability of the ith possible value is pe for 
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i=1,...,k. Then we compute 


k 
O= (N; — np?) 


0 
i=l NP; 


where n = a N; is the sample size. When the null hypothesis says that the data 
have a continuous distribution, then one must first create a corresponding discrete 
distribution. One does this by dividing the real line into finitely many (say, k) in- 
tervals, calculating the probability of each interval Po oe ps and then pretending 
as if all we learned from the data were into which intervals each observation fell. 
This converts the original data into discrete data with k possible values. For ex- 
ample, the value of N; used in the formula for Q is the number of observations 
that fell into the ith interval. All of the x? test statistics in this text have the form 
> (observed—expected)*/expected, where “observed” stands for an observed count 
and “expected” stands for the expected value of the observed count under the as- 


sumption that the null hypothesis is true. 


Exercises 


1. Consider the hypotheses being tested in Example 
10.1.5. Use a test procedure of the form outlined in Exer- 
cise 11 of Sec. 9.1 and compare the result to the numerical 
result obtained in Example 10.1.5. 


2. Show that if pe =1/kfori=1,...,k, then the statistic 
Q defined by Eq. (10.1.2) can be written in the form 


k k 
_[{k a) 
o-(t yar) n. 


3. Investigate the “randomness” of your favorite pseudo- 
random number generator as follows. Simulate 200 
pseudo-random numbers between 0 and 1 and divide the 
unit interval into k = 10 intervals of length 0.1 each. Apply 
the x? test of the hypothesis that each of the 10 intervals 
has the same probability of containing a pseudo-random 
number. 


4. According to a simple genetic principle, if both the 
mother and the father of a child have genotype Aa, then 
there is probability 1/4 that the child will have genotype 
AA, probability 1/2 that she will have genotype Aa, and 
probability 1/4 that she will have genotype aa. Inarandom 
sample of 24 children having both parents with genotype 
Aa, itis found that 10 have genotype AA, 10 have genotype 
Aa, and four have genotype aa. Investigate whether the 
simple genetic principle is correct by carrying out a x? test 
of goodness-of-fit. 


5. Suppose that in a sequence of n Bernoulli trials, the 
probability p of success on each trial is unknown. Suppose 
also that pp is a given number in the interval (0, 1), and it 
is desired to test the following hypotheses: 


Ao: P= Po, 
Ay: pF Po- 


Let X,, denote the proportion of successes in the n trials, 
and suppose that the given hypotheses are to be tested by 
using a x? test of goodness-of-fit. 


a. Show that the statistic Q defined by Eq. (10.1.2) can 
be written in the form 


= n(X, — Po)” 
Po — po) 


b. Assuming that Hp is true, prove that as n — oo, the 
c.d.f. of Q converges to the c.d.f. of the x7 distribution 
with one degree of freedom. Hint: Show that Q = Z”, 
where it is known from the central limit theorem that 
Z is arandom variable whose c.d.f. converges to the 
c.d.f. of the standard normal distribution. 


6. It is known that 30 percent of small steel rods produced 
by a standard process will break when subjected to a load 
of 3000 pounds. In a random sample of 50 similar rods pro- 
duced by anew process, it was found that 21 of them broke 
when subjected to a load of 3000 pounds. Investigate the 
hypothesis that the breakage rate for the new process is 
the same as the rate for the old process by carrying out a 
x? test of goodness-of-fit. 


7. In a random sample of 1800 observed values from the 
interval (0, 1), it was found that 391 values were between 
0 and 0.2, 490 values were between 0.2 and 0.5, 580 values 
were between 0.5 and 0.8, and 339 values were between 
0.8 and 1. Test the hypothesis that the random sample was 
drawn from the uniform distribution on the interval [0, 1] 
by carrying out a x? test of goodness-of-fit at the level of 
significance 0.01. 
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8. Suppose that the distribution of the heightsofmenwho — Table 10.5 Data for Exercise 9 


reside in a certain large city is the normal distribution for 


which the mean is 68 inches and the standard deviation —1.28 —1.22 —0.45 —0.35 0.72 
is 1 inch. Suppose also that when the heights of 500 men —0.32 —0.80 —1.66 1.39 0.38 


who reside in a certain neighborhood of the city were 


measured, the distribution in Table 10.4 was obtained. Test 1.38 —1.26 0.49 —0.14 —0.85 

the hypothesis that, with regard to height, these 500 men 2.33 —0.34 —1.96 —0.64 —1.32 

form a random sample from all the men who reside in the 1.14 0.64 3.44 1.67 0.85 

ity. 

= 041 -0.01 067 -113  -041 

Table 10.4 Data for Exercise 8 —0.49 0.36 —1.24 —0.04 —0.11 

Height Minieerorinea 1.05 0.04 0.76 0.61 —2.04 
0.35 2.82 —0.46 —0.63 —1.61 

em 1s 0.64 0.56 -O11 013 —18i 

Between 66 and 67.5 in. 177 

Beiwers Oboand ae re 128 a. Carry outa x? test of goodness-of-fit by dividing the 

Between 68.5 and 70 in. 102 real line into five intervals, each of which has proba- 

Greater than 70 in. 5 bility 0.2 under the standard normal distribution. 


9, The 50 values in Table 10.5 are intended to be arandom 
sample from the standard normal distribution. 


Example 
10.2.1 


b. Carry out a x* test of goodness-of-fit by dividing the 
real line into 10 intervals, each of which has prob- 
ability 0.1 under the standard normal distribution. 


10.2 Goodness-of-Fit for Composite Hypotheses 


We can extend the goodness-of-fit test to deal with the case in which the null 
hypothesis is that the distribution of our data belongs to a particular parametric 
family. The alternative hypothesis is that the data have a distribution that is not a 
member of that parametric family. There are two changes to the test procedure 
in going from the case of a simple null hypothesis to the case of a composite 
null hypothesis. First, in the test statistic Q, the probabilities pe are replaced by 
estimated probabilities based on the parametric family. Second, the degrees of 
freedom are reduced by the number of parameters. 


Composite Null Hypotheses 


Failure Times of Ball Bearings. In Example 10.1.6, we tested the null hypothesis that 
the logarithms of ball bearing lifetimes have the normal distribution with mean 3.912 
and variance 0.25. Suppose that we are not even sure that a normal distribution is 
a good model for the log-lifetimes. Is there a way for us to test the composite null 
hypothesis that the distribution of log-lifetimes is a member of the normal family? 

< 


We shall consider again a large population that consists of items of k different 
types and again let p; denote the probability that an item selected at random will 
be of type i (( =1,..., ). We shall suppose now, however, that instead of testing 
the simple null hypothesis that the parameters p;,..., p, have specific values, we 
are interested in testing the composite null hypothesis that the values of p1,..., Dx 
belong to some specified subset of possible values. In particular, we shall consider 
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Example 
10.2.2 


problems in which the null hypothesis specifies that the parameters p,,..., py can 
actually be represented as functions of a smaller number of parameters. 


Genetics. Consider a gene (such as in Example 1.6.4 on page 23) that has two dif- 
ferent alleles. Each individual in a given population must have one of three possible 
genotypes. If the alleles arrive independently from the two parents, and if every par- 
ent has the same probability 6 of passing the first allele to each offspring, then the 
probabilities p;, p2, and p3 of the three different genotypes can be represented in the 
following form: 


Pi=0", py» =20(1-0),  p3= (1-8). (10.2.1) 


Here, the value of the parameter @ is unknown and can lie anywhere in the interval 
0 <@ <1.Foreach value of 6 in this interval, it can be seen that p; > Ofori = 1, 2, or3, 
and p, + po + p3 = 1. In this problem, a random sample is taken from the population, 
and the statistician must use the observed numbers of individuals who have each of 
the three genotypes to determine whether it is reasonable to believe that there is 
some value of 6 in the interval 0 < 6 < 1 such that p;, p2, and p3 can be represented 
in the hypothesized form (10.2.1). 

If a gene has three different alleles, each individual in the population must have 
one of six possible genotypes. Once again, if the alleles pass independently from the 
parents, and if each parent has probabilities 6, and 6 of passing the first and second 
alleles, respectively, to an offspring, then the probabilities p;, ..., p¢ of the different 
genotypes can be represented in the following form, for some values of 6; and 6, such 
that al > 0, ) > 0, and 0, + ) <i: 


Pi=0}, pr, =03, p3=(1-0;—%)", py = 201, 


(10.2.2) 
Ps =20,(1—6,— 6), pg =262(1 — 6; — 6). 


Again, for all values of 0; and 6, satisfying the stated conditions, it can be verified 
that p; > 0 fori =1,...,6 and aa p; = 1. On the basis of the observed numbers 
N,,..., Ne of individuals having each genotype in a random sample, the statisti- 
cian must decide whether or not to reject the null hypothesis that the probabilities 
P1,---> Po can be represented in the form (10.2.2) for some values of 6; and 4). 


In formal terms, in a problem like those in Example 10.2.2, we are interested in 
testing the hypothesis that fori =1, ...,k, each probability p; can be represented as 
a particular function 7;(6) of a vector of parameters 0 = (0), ..., 6,). It is assumed 
that s <k —1 and no component of the vector 6 can be expressed as a function 
of the other s — 1 components. We shall let Q denote the s-dimensional parameter 
space of all possible values of 6. Furthermore, we shall assume that the functions 
(0), ..., 1,(8) always form a feasible set of values of p;,..., p; in the sense that 
for every value of 6 € Q,7;(0) > Ofori =1,...,k and 4 1; (0) = 1. 

The hypotheses to be tested can be written in the following form: 


Ho: There exists a value of 6 € Q such that 
pj =7,(0) fori=1,...,k, (10.2.3) 
H,: The hypothesis Hp is not true. 


The assumption that s < k — 1 guarantees that the hypothesis Hp actually restricts 
the values of p),..., py to a proper subset of the set of all possible values of these 
probabilities. In other words, as the vector 6 runs through all the values in the set Q, 


Theorem 
10.2.1 


Example 
10.2.3 
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the vector [7(0), ... , ,(0)]runs through only a proper subset of the possible values 
of (pi, ---, Px) 


The x2 Test for Composite Null Hypotheses 


In order to carry out a x? test of goodness-of-fit of the hypotheses (10.2.3), the statistic 
Q defined by Eq. (10.1.2) must be modified because the expected number np? of 
observations of type i in a random sample of n observations is no longer completely 
specified by the null hypothesis Hp. The modification that is used is simply to replace 
np? by the M.L.E. of this expected number under the assumption that Hp is true. In 


other words, if @ denotes the M.L.E. of the parameter vector @ based on the observed 


numbers N,,..., N;, then the statistic Q is defined as follows: 
k 
N; — nn; [N; — nx, (6)P 
Q= . 10.2.4 
eer ae nm; (6) ( ) 


i=l 


Again, it is reasonable to base a test of the hypotheses (10.2.3) on this statistic 
Q by rejecting Hp if Q > c, where c is an appropriate constant. In 1924, R. A. Fisher 
proved the following result, whose precise statement and proof are not given here. 
(See Schervish 1995, theorem 7.133.) 


x° Test for Composite Null. Suppose that the null hypothesis Ho in (10.2.3) is true and 
certain regularity conditions are satisfied. Then as the sample size n > oo, the c.d.f. 
of Q in (10.2.4) converges to the c.d.f. of the x? distribution with k — 1 — s degrees 
of freedom. a 


When the sample size n is large and the null hypothesis Hp is true, the distribution 
of Q will be approximately a x? distribution. To determine the number of degrees of 
freedom, we must subtract s from the number k — 1 used in Sec. 10.1 because we are 
now estimating the s parameters 6), ..., 6, when we compare the observed number 
N; with the expected number nz; (6) fori =1, ..., k. Inorder that this result will hold, 
it is necessary to satisfy the following regularity conditions: First, the M.L.E. 6 of the 
vector 6 must occur at a point where the partial derivatives of the likelihood function 
with respect to each of the parameters 6), ..., 6, equal 0. Furthermore, these partial 
derivatives must satisfy certain conditions of the type alluded to in Sec. 8.8 when we 
discussed the asymptotic properties of M.L.E.’s. 


Genetics. As examples of the use of the statistic Q defined by Eq. (10.2.4), consider 
the two types of genetics problems described in Example 10.2.2. In a problem of the 
first type, k = 3, and it is desired to test the null hypothesis Hp that the probabilities 
P1, P2, and p3 can be represented in the form (10.2.1) against the alternative H, that 
Ao is not true. In this problem, s = 1. Therefore, when Hp is true, the distribution of 
the statistic Q defined by Eq. (10.2.4) will be approximately the x7 distribution with 
one degree of freedom. 

Ina problem of the second type, k = 6, and it is desired to test the null hypothesis 
Ho that the probabilities p,,..., ps can be represented in the form (10.2.2) against 
the alternative H, that Hp is not true. In this problem, s = 2. Therefore, when Ho 
is true, the distribution of Q will be approximately the x? distribution with three 
degrees of freedom. J 
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Example 
10.2.4 


Determining the Maximum Likelihood Estimates 


When the null hypothesis Ho in (10.2.3) is true, the likelihood function L(@) for the 
observed numbers Nj, ..., N;, will be 


LO=(y "9, Jonette (10.2.5) 


Thus, 


k 
log L(6) = log ( : ) + > N; log 7; (8). (10.2.6) 
NiccaigNe? Fa 


The M.L.E. 6 will be the value of 6 for which log L(@) isa maximum. The multinomial 
coefficient in (10.2.6) does not affect the maximization, and we shall ignore it for the 
remainder of this section. 


Genetics. In the first parts of Examples 10.2.2 and 10.2.3, k =3 and Hp specifies that 

the probabilities p;, p», and p3 can be represented in the form (10.2.1). In this case, 
log L(6) = N, log(6) + N> log[20(1 — 6)] + N3 log[(1 — 4)7] 

= (2N, + N2) log 6 + (2N3 + N>) log(1 — @) + Np log 2. 


It can be found by differentiation that the value of 6 for which log L(@) is a maxi- 
mum is 


(10.2.7) 


2N,+N,  _ 2N,+Ny 
2(Nj+Nr2+N3) 9 2n | 


The value of the statistic Q defined by Eq. (10.2.4) can now be calculated from 
the observed numbers N,, N>, and N3. As previously mentioned, when Hp is true and 
n is large, the distribution of Q will be approximately the x distribution with one 
degree of freedom. Hence, the tail area corresponding to the observed value of Q 
can be found from that x? distribution. < 


6= 


(10.2.8) 


Testing Whether a Distribution Is Normal 


Consider now a problem in which a random sample X,..., X,, is taken from some 
continuous distribution for which the p.d.f. is unknown, and it is desired to test the null 
hypothesis Ho that this distribution is a normal distribution against the alternative 
hypothesis H, that the distribution is not normal. To perform a x? test of goodness- 
of-fit in this problem, divide the real line into k subintervals and count the number N; 
of observations in the random sample that fall into the ith subinterval (i =1,..., k). 

If Ap is true, and if and o” denote the unknown mean and variance of the 
normal distribution, then the parameter vector 6 is the two-dimensional vector 6 = 
(uw, 0”). The probability 7; (6), or 2; (2, o“), that an observation will fall within the ith 
subinterval, is the probability assigned to that subinterval by the normal distribution 
with mean jz and variance o”. In other words, if the ith subinterval is the interval 


from a; to b;, then 
bj 2 
: i (x — pL) 
2 
Tj (UU, 07) = ——— exp] —————_| dx 
(ML, o7) [ (nyo | 752 


n+ (t=) (254), 


(10.2.9) 


Theorem 
10.2.2 
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where ®(-) is the standard normal c.d.f., and ®(—oo) = 0 and ®(oo) = 1. 

It isimportant to note that in order to calculate the value of the statistic Q defined 
by Eq. (10.2.4), the M.L.E.’s ji and 6” must be found by using the numbers Nj, ..., Ny 
of observations in the different subintervals. The M.L.E.’s should not be found by 
using the observed values of X,,..., X,, themselves. In other words, fi and 6? will 
be the values of jz and o” that maximize the likelihood function 


L(y, a”) = [7 (HL, o”)} vaca [7 (HL, o*) Xe. (10.2.10) 


Because of the complicated nature of the function z;(j, 07), as given by Eq. 
(10.2.9), a lengthy numerical computation would usually be required to determine 
the values of w and o? that maximize L(w, 0”). On the other hand, we know that 
the M.L.E.’s of w and o” based on the n observed values X1, ..., X,, in the original 
sample are simply the sample mean X,, and the sample variance se /n. Furthermore, if 
the estimators that maximize the likelihood function L(j, 0”) are used to calculate 
the statistic Q, then we know that when Hp is true, the distribution of Q will be 
approximately the x? distribution with k — 3 degrees of freedom. On the other hand, 
if the M.L.E.’s X,, and ce /n, which are based on the observed values in the original 
sample, are used to calculate Q, then this x* approximation to the distribution of 
Q will not be appropriate. Because of the simple nature of the estimators X,, and 
s /n, we shall use these estimators to calculate Q, but we shall describe how their 
use modifies the distribution of Q. 

In 1954, H. Chernoff and E. L. Lehmann established the following general result, 
which we shall not prove here. 


Let X,,..., X,, be a random sample from a distribution with a p-dimensional pa- 
rameter 6. Let 6, denote the M.L.E. as defined in Definition 7.5.2. Partition the real 
line into k > p + 1 disjoint intervals /,,..., 1,. Let N; be the number of observations 
that fall into J; fori =1,...,k. Let 7;(@) = Pr(X; € 1;|0). Let 


[Ni — 171 6,)P uP 
10.2.11 
ay Romer nm; (6) 


Assume the regularity conditions needed for asymptotic normality of the M.L.E. 
Then, as n — ov, the c.d.f. of Q’ converges to a c.d.f. that lies between the c.d.f. of the 
x distribution with k — p — 1 degrees of freedom and the c.d.f. of the x? distribution 
with k — 1 degrees of freedom. o 


For the case of testing that the distribution is normal, suppose that we use the 
M.L.E.’s X,, and ea n and calculate the statistic Q’ in Eq. (10.2.11) instead of the 
statistic Q in Eq. (10.2.4). If the null hypothesis Hp is true, then as n > on, the c.d-f. 
of Q’ converges to ac.d.f. that lies between the c.d.f. of the x? distribution with k — 3 
degrees of freedom and the c.d.f. of the x? distribution with k — 1 degrees of freedom. 
It follows that if the value of Q’ is calculated in this simplified way, then the tail area 
corresponding to this value of Q’ is actually larger than the tail area found from a 
table of the x? distribution with k — 3 degrees of freedom. In fact, the appropriate tail 
area lies somewhere between the tail area found from a table of the x? distribution 
with k — 3 degrees of freedom and the larger tail area found from a table of the x? 
distribution with k — 1 degrees of freedom. Thus, when the value of Q’ is calculated 
in this simplified way, the corresponding tail area will be bounded by two values that 
can be obtained from a table of the x distribution. 
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Example 
10.2.6 


Failure Times of Ball Bearings. Return to Example 10.2.1. We are now in a position 
to try to test the composite null hypothesis that the logarithms of ball bearing 
lifetimes have some normal distribution. We shall divide the real line into the same 
subintervals that we used in Example 10.1.6, namely, (—o0o, 3.575], (3.575, 3.912], 
(3.912, 4.249], and (4.249, co). The counts for the four intervals are still 3, 4, 8, and 
8. We shall use Theorem 10.2.2, which allows us to use the M.L.E.’s based on the 


original data. This yields fi = 4.150 and 6? = 0.2722. The probabilities of the four 
intervals are 


3.575 — 4.150 
nm ad 
@ao| 22? Veanass6, 
Aas O°) ( (0.2722) 1/2 ) 
oe — eae eae 
(0.2722)12 (0.2722)'2 
4.249 — 4.150 3.912 — 4.150 
a 
62) =0 = 0.2511, 
Fae) ( (0.2722) 1/2 ) ( (0.2722) 1/2 ) 
4.249 — 4.150 
sd 
6)=1-0 = 0.4251. 
Hatha) ( (0.2722)1/2 ) 


This makes the value of Q’ equal to 


(3 — 23 x 0.1350)? e (4 — 23 x 0.1888) n (8 — 23 x 0.2511)? 
23 x 0.1350 23 x 0.1888 23 x 0.2511 


(8 — 23 x 0.4251)? 
23 x 0.4251 


The tail area corresponding to 1.211 needs to be computed for x? distributions with 
k —1=3andk —3=1 degrees of freedom. For one degree of freedom, the p-value 
is 0.2711, and for three degrees of freedom the p-value is 0.7504. So, our actual p- 
value lies in the interval [0.2711, 0.7504]. Although this interval is wide, it tells not to 
reject Hp at level a if a < 0.2711. < 


Q’= 


= 1.211. 


Note: Testing Composite Hypotheses about an Arbitrary Distribution. Theorem 
10.2.2 is very general and applies to both continuous and discrete distributions. 
Suppose, for example, that arandom sample of n observations is taken from a discrete 
distribution for which the possible values are the nonnegative integers 0, 1, 2,.... 
Suppose also that it is desired to test the null hypothesis Ho that this distribution is a 
Poisson distribution against the alternative hypothesis H, that the distribution is not 
Poisson. Finally, suppose that the nonnegative integers 0, 1, 2,... are divided into k 
classes such that each observation will lie in one of these classes. 

It is known from Exercise 5 of Sec. 7.5 that if Ho is true, then the sample 
mean X, is the M.L.E. of the unknown mean @ of the Poisson distribution based 
on the n observed values in the original sample. Therefore, if the estimator 6 = X,, 
is used to calculate the statistic Q’ defined by Eq. (10.2.11) rather than the Q in 
Eq. (10.2.4), then the approximate distribution of Q’ when Ab is true lies between 
the x? distribution with k — 2 degrees of freedom and the x? distribution with k — 1 
degrees of freedom. 


Prussian Army Deaths. In Example 7.3.14, we modeled the numbers of deaths by 
horsekick in Prussian army units as Poisson random variables. Suppose that we wish 
to test the null hypothesis that the numbers are a random sample from some Poisson 
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distribution versus the alternative hypothesis that they are not a Poisson random 
sample. The numbers of counts reported in Example 7.3.14 are repeated here: 


Count 0 1 2 3 >4 


Number of Observations 144 91 32 11 2 


The likelihood function, assuming that the data form a random sample from a Poisson 
distribution, is proportional (as a function of @) to exp(—2800)6!%°. The M.L.E. is 
6, = 196/280 = 0.7. We can use the k =5 classes above to compute the statistic Q’. 
The five class probabilities are 


Count 0 1 2 3 >4 


17; (6n) 0.4966 0.3476 0.1217 ~—-0.0283~—Ss«<0..0058 


Then 
, _ (144 — 280 x 0.4966)? n (91 — 280 x 0.3476)? (32 — 280 x 0.1217) 
7 280 x 0.4966 280 x 0.3476 280 x 0.1217 
(11 — 280 x 0.0283)? (2 — 208 x 0.0058)? _ ioe 
280 x 0.0283 208x 0.0005 © 


The tail areas corresponding to the observed Q’ and degrees of freedom four and 
three are, respectively, 0.7396 and 0.5768. We would not be able to reject Ho at level 
ay for ag < 0.5768. < 


Summary 


If we want to test the composite hypothesis that our data have a distribution from 
a parametric family, we must estimate the parameter 0. We do this by first dividing 
the real numbers into & disjoint intervals. Then we reduce the data to the counts 
Ni,..., N, of how many observations fall into each of the & intervals. We then 
construct the likelihood function L(@) = ex 1; (0)%i, where 7; (0) is the probability 
that one observation falls into the ith interval. We estimate @ to be the value @ that 
maximizes L(0). We then compute the test statistic Q = ae he — nz; (6)P/[n7;(6)], 
which has the form )‘(observed—expected)*/expected. In order to test the null 
hypothesis at level aj, we compare Q to the 1 — ag quantile of the x? distribution 
with k — 1 — s degrees of freedom, where s is the dimension of 6. Alternatively, we 
can find the usual M.L.E. 6 based on the original observations. In this case, we need 
to compare Q to a number between the 1 — ag quantile of the x? distribution with 
k — 1 —s degrees of freedom and the 1 — a quantile of the x? distribution with k — 1 
degrees of freedom. 
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Exercises 


1. The 41 numbers in Table 10.6 are average sulfur diox- 
ide contents over the years 1969-71 (micrograms per cubic 
meter) measured in the air in 41 US. cities. The data ap- 
pear on pp. 619-620 of Sokal and Rohlf (1981). 


a. Test the null hypothesis that these data arise from a 
normal distribution. 


b. Test the null hypothesis that these data arise from a 
lognormal distribution. 


Table 10.6 Sulfur dioxide in the air of 41 US. cities 


10 13 12 17 56 36 29 
14 10 24 110 28 17 8 
30 9 47 35 29 14 56 
14 11 46 11 23 65 26 
69 61 94 10 18 9 10 
28 Sik 26 29 31 16 


2. At the fifth hockey game of the season at a certain 
arena, 200 people were selected at random and asked how 
many of the previous four games they had attended. The 
results are given in Table 10.7. Test the hypothesis that 
these 200 observed values can be regarded as a random 
sample from a binomial distribution; that is, there exists 
a number @ (0 <@ < 1) such that the probabilities are as 
follows: 

py = 66°(1— 8)”, 


po=(1—0)4,  p, = 4001-0), 


p3=40°1- 0), py =O". 


Table 10.7 Data for Exercise 2 


Number of games Number of 
previously attended people 
0 33 
1 67 
2 66 
3 15 
4 19 


3. Consider a genetics problem in which each individual 
in a certain population must have one of six genotypes, 
and it is desired to test the null hypothesis Hp that the 
probabilities of the six genotypes can be represented in 
the form specified in Eq. (10.2.2). 


a. Suppose that in a random sample of n individuals, 
the observed numbers of individuals having the six 


genotypes are Nj, ..., No. Find the M.L.E.’s of 6; and 
6) when the null hypothesis Hp is true. 


b. Suppose that in a random sample of 150 individuals, 
the observed numbers are as follows: 


1 =2, 
Ns =20, 


N> = 36, 
Ne = 42. 


N3=14, N,=36, 


Determine the value of Q and the corresponding tail 
area. 


4. Consider again the sample consisting of the heights of 
500 men given in Exercise 8 of Sec. 10.1. Suppose that 
before these heights were grouped into the intervals given 
in that exercise, it was found that for the 500 observed 
heights in the original sample, the sample mean was X,, = 
67.6 and the sample variance was S2/n = 1.00. Test the 
hypothesis that these observed heights form a random 
sample from a normal distribution. 


5. In a large city, 200 persons were selected at random, 
and each person was asked how many tickets he purchased 
that week in the state lottery. The results are given in 
Table 10.8. Suppose that among the seven persons who 
had purchased five or more tickets, three persons had 
purchased exactly five tickets, two persons had purchased 
six tickets, one person had purchased seven tickets, and 
one person had purchased 10 tickets. Test the hypothesis 
that these 200 observations form a random sample from a 
Poisson distribution. 


Table 10.8 Data for Exercise 5 


Number of tickets Number of 
previously purchased persons 
0 52 
1 60 
2 55 
3 18 
4 8 
5 or more 7 


6. Rutherford and Geiger (1910) counted the numbers 
of alpha particles emitted by a certain mass of polonium 
during 2608 disjoint time periods, each of which lasted 
7.5 seconds. The results are given in Table 10.9. Test the 
hypothesis that these 2608 observations form a random 
sample from a Poisson distribution. 


Table 10.9 Data for Exercise 6 from Rutherford 
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Table 10.10 Data for Exercise 7 


and Geiger (1910) 9.69 8.93 7.61 8.12 —2.74 
Niumbearar Number pt 2.78 7.47 8.46 7.89 5.93 
particles emitted time periods 5.21 2.62 0.22 —0.59 8.77 
0 57 4.07 5.15 8.32 6.01 0.68 
om a 
: sh 10.22 5.05 1451 13.05 
: = 9.0 20 80 8.6 52 
‘ — pe os ne - i 
° 16.80 8.07 0.66 401 oo 
6 073 ; : F ; . 
¢ 139 8. Test the hypothesis that the 50 observations in Table 
8 45 10.11 form a random sample from an exponential distri- 
9 27 bution. 
10 10 Table 10.11 Data for Exercise 8 
11 4 
15 é 0.91 1.22 1.28 0.22 2.33 
ia i 0.90 0.86 1.45 1.22 0.55 
ie i 0.16 2.02 1.59 173 0.49 
1.62 0.56 0.53 0.50 0.24 
15 or more 0 
1.28 0.06 0.19 0.29 0.74 
Total 2608 1.16 0.22 0.91 0.04 1.41 


3.65 3.41 0.07 0.51 1.27 


7. Test the hypothesis that the 50 observations in Table 0.61 0.31 0.22 0.37 0.06 


10.10 form a random sample from a normal distribution. 


Example 


10.3.1 


1.75 0.89 0.79 1.28 0.57 
0.76 0.05 1.53 1.86 1.28 


10.3 Contingency Tables 


When each observation in our sample is a bivariate discrete random vector (a pair 
of discrete random variables), then there is a simple way to test the hypothesis that 
the two random variables are independent. The test is another form of x? test like 
the ones used earlier in this chapter. 


Independence in Contingency Tables 


College Survey. Suppose that 200 students are selected at random from the entire 
enrollment at a large university, and each student in the sample is classified both 
according to the curriculum in which he is enrolled and according to his preference for 
either of two candidates A and B in a forthcoming election. Suppose that the results 
are as presented in Table 10.12. We might be interested in whether the choices of 
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Definition 
10.3.1 


Table 10.12 Classification of students by curriculum and candidate preference 
Candidate preferred 
Curriculum AB Undecided Totals 
Engineering and science 24 23 12 59 
Humanities and social sciences 24 14 10 48 
Fine arts 17 8 13 38 
Industrial and public administration 27 19 9 55 
Totals 92 «64 44 200 


curriculum and candidate are independent of each other. To be more precise, suppose 
that a student is selected at random from the entire enrollment at the university. 
Independence means that for each i and j, the probability that such a randomly 
chosen student prefers candidate j and is in curriculum i equals the product of the 
probability that he prefers candidate j times the probability that he is enrolled in 
curriculum i. < 


Tables of data like Table 10.12 are very common and have a special name. 


Contingency Tables. A table in which each observation is classified in two or more 
ways is called a contingency table. 


In Table 10.12, only two classifications are considered for each student, namely, 
the curriculum in which he is enrolled and the candidate he prefers. Such a table is 
called a two-way contingency table. 

In general, we shall consider a two-way contingency table containing R rows and 
C columns. Fori=1,..., Rand j =1,...,C,we shall let Pij denote the probability 
that an individual selected at random from a given population will be classified in 
the ith row and the jth column of the table. Furthermore, we shall let p;, denote the 
marginal probability that the individual will be classified in the ith row of the table 
and p,; denote the marginal probability that the individual will be classified in the 
jth column of the table. Thus, 


ie R 
Pi+ = 2 Piz and pyjy= > Pij- 
j=l i=1 


Furthermore, since the sum of the probabilities for all the cells of the table must be 1, 
we have 
R C R c 
SY y= > pa =) a1 


i=1- j=1 i=] j=1 


Suppose now that a random sample of n individuals is taken from the given 
population. Fori=1,..., R, and j =1,...,C, we shall let N;; denote the number 
of individuals who are classified in the ith row and the jth column of the table. 
Furthermore, we shall let N;, denote the total number of individuals classified in the 
ith row and N_, ; denote the total number of individuals classified in the jth column. 
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Thus, 


Also, 


RG R Cc 
Ny => Na = OM ae. (10.3.2) 


On the basis of these observations, the following hypotheses are to be tested: 


Ho: = D; / fori=1,...,Randj=1,...,C, 
0: Pij = Pi+P+j | J (10.3.3) 
H,: The hypothesis Hp is not true. 


The x2 Test of Independence 


The x? tests described in Sec. 10.2 can be applied to the problem of testing the 
hypotheses (10.3.3). Each individual in the population from which the sample is taken 
must belong in one of the RC cells of the contingency table. Under the null hypothesis 
Ho, the unknown probabilities p;; of these cells have been expressed as functions 
of the unknown parameters p;, and p,;. Since yy Pi+ =1 and Sea pij=1, 
the actual number of unknown parameters to be estimated when Hp is true is s = 
(R-1)+(C-D,ors=R+C —-2. 

Fori=1,..., R,and j=1,...,C, let Ej denote the M.L.E., when Hp is true, 
of the expected number of observations that will be classified in the ith row and the 
jth column of the table. In this problem, the statistic Q defined by Eq. (10.2.4) will 
have the following form: 


Q= ms a. (10.3.4) 


Furthermore, since the contingency table contains RC cells, and since s = R+ C —2 
parameters are to be estimated when Ap is true, it follows that when Hp is true and 
n —> 00, the c.d.f. of Q converges to the c.d.f. of the x distribution for which the 
number of degrees of freedom is RC — 1—s =(R—1)(C — 1). 

Next, we shall consider the form of the estimator E; j- The expected number 
of observations in the ith row and the jth column is simply np;;. When Hp is true, 
Pij = Pi+P+;- Therefore, if p;, and p,; denote the M.L.E.’s of p;, and p,;, then it 
follows that E; j =NPj+P+;-Next, since p;, is the probability that an observation will 
be classified in the ith row, p;, is simply the proportion of observations in the sample 
that are classified in the ith row; that is, p;, = N;/n. Similarly, p; = N,j;/n, and it 


follows that 
ix N; Nu; N,N; 
By =n( i) ( ‘1 amar (10.3.5) 


n n n 


If we substitute this value of Ej; into Eq. (10.3.4), we can calculate the value 
of Q from the observed values of N;;. The null hypothesis Hp should be rejected if 
Q >c, where c is an appropriately chosen constant. When Ap is true, and the sample 
size n is large, the distribution of Q will be approximately the x? distribution with 
(R — 1)(C — 1) degrees of freedom. 
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Example 
10.3.2 


Example 
10.3.3 


Table 10.13 Expected cell counts for Example 10.3.2 
Candidate preferred 
Curriculum A B Undecided Totals 
Engineering and science 27.14 18.88 12.98 59 
Humanities and social sciences 22.08 15.36 10.56 48 
Fine arts 17.48 12.16 8.36 38 
Industrial and public administrations 25.30 17.60 12.10 55 
Totals 92 64 44 200 


College Survey. Suppose that we wish to test the hypotheses (10.3.3) on the basis of 
the data in Table 10.12. By using the totals given in the table, we find that N,, = 59, 
Nz, = 48, N3, = 38, and Ny, =55, and also N,, = 92, Ny2 = 64, and N,3 = 44. 
Because n = 200, it follows from Eq. (10.3.5) that the 4 x 3 table of values of Ej; 
is as shown in Table 10.13. 

The values of N;; given in Table 10.12 can now be compared with the values of 
Ey; in Table 10.13. The value of Q defined by Eq. (10.3.4) turns out to be 6.68. Since 
R =4 and C =3, the corresponding tail area is to be found from a table of the x? 
distribution with (R — 1)(C — 1) = 6 degrees of freedom. Its value is larger than 0.3. 
Therefore, we would only reject Hp at level a if ag > 0.3. <l 


Montana Outlook Poll. In Example 10.1.4, we examined the surveyed opinions of 
Montana residents on their personal financial status. Another question that survey 
participants were asked was an income range. Table 10.14 gives a cross-tabulation of 
the answers to both questions. We can use the x? test to test the null hypothesis that 
income is independent of opinion on personal financial status. Table 10.15 gives the 
expected counts for each cell of Table 10.14 under the null hypothesis. We can now 
compute the test statistic Q = 5.210 with (3 — 1) x (3 — 1) = 4 degrees of freedom. 
The p-value associated with this value of Q is 0.266, so we would only reject the null 
hypothesis at a level ag greater than 0.266. 4 


Table 10.14 Responses to two questions from Montana 
Outlook Poll 


Personal financial status 


Income range Worse Same _ Better Total 
Under $20,000 20 15 12 47 
$20,000 -$35,000 24 pa 32 83 
Over $35,000 14 22 23 59 


Total 58 64 67 189 


10.3 Contingency Tables 645 


Table 10.15 Expected cell counts for Table 10.14 under the 
assumption of independence 


Personal financial status 


Income range Worse Same _ Better Total 


Under $20,000 14.42 15.92 16.66 47 
$20,000-$35,000 25.47 28.11 29.42 83 


Over $35,000 18.11 19.98 20,92 59 
Total 58 64 67 189 
Summary 


We learned how to test the null hypothesis that two discrete random variables 
are independent based on a random sample of n pairs. First, form a contingency 
table of the counts for every pair of possible observed values. Then, estimate the 
two marginal distributions of the two random variables. Under the null hypothe- 
sis that the random variables are independent, the expected count for value i of 
the first variable and value j of the second variable is n times the product of the 
two estimated marginal probabilities. We then form the x? statistic @ by summing 
(observed—expected)*/expected over all of the cells in the contingency table. The 
degrees of freedom is (R — 1)(C — 1), where R is the number of rows in the table and 
C is the number of columns. 


Exercises 

1. Chase and Dummer (1992) studied the attitudes of RC N?. 
school-aged children in Michigan. The children were Q= x, es 
asked which of the following was most important to them: i=1 j=1 Bij 


good grades, athletic ability, or popularity. Additional 

information about each child was also collected, and 3. Show that if C =2, the statistic Q defined by Eq. 
Table 10.16 shows the results for 478 children classified (10.3.4) can be rewritten in the form 

by sex and their response to the survey question. Test the & 2 

null hypothesis that a child’s answer to the survey question O n ( Nii _ vas) 


is independent of his or her sex. ~ Nia 
i= 


Table 10.16 Data for Exercise 1 from Chase and Dummer 4. Suppose that an experiment is carried out to see if 


(1992) there is any relation between a man’s age and whether 
he wears a moustache. Suppose that 100 men, 18 years 
of age or older, are selected at random, and each man 
is classified according to whether or not he is between 
Boys 117 60 50 18 and 30 years of age and also according to whether 
Girls 130 30 91 or not he wears a moustache. The observed numbers are 
given in Table 10.17. Test the hypothesis that there is no 
relationship between a man’s age and whether he wears a 
2. Show that the statistic Q defined by Eq. (10.3.4) can be moustache. 

rewritten in the form 


Good grades Athletic ability Popularity 
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Table 10.17 Data for Exercise 4 


Wears a Does not wear 
moustache a moustache 
Between 18 and 30 12 28 
Over 30 8 52 


5. Suppose that 300 persons are selected at random from 
a large population, and each person in the sample is clas- 
sified according to blood type, O, A, B, or AB, and also 
according to Rh, positive or negative. The observed num- 
bers are given in Table 10.18. Test the hypothesis that the 
two classifications of blood types are independent. 


Table 10.18 Data for Exercise 5 


O A B AB 
Rh positive 82 89 54 19 
Rh negative 13 27 7 9 


6. Suppose that a store carries two different brands, A 
and B, of a certain type of breakfast cereal. Suppose that 
during a one-week period the store noted whether each 
package of this type of cereal that was purchased was 
brand A or brand B and also noted whether the purchaser 
was a man or a woman. (A purchase made by a child 
or by a man and a woman together was not counted.) 
Suppose that 44 packages were purchased, and that the 
results were as shown in Table 10.19. Test the hypothesis 
that the brand purchased and the sex of the purchaser are 
independent. 


Table 10.19 Data for Exercise 6 


Brand A Brand B 
Men 9 6 
Women 13 16 


7. Consider a two-way contingency table with three rows 
and three columns. Suppose that, for i =1, 2,3 and j = 
1, 2, 3, the probability p;; that an individual selected at 
random from a given population will be classified in the 
ith row and the jth column of Table 10.20. 


Table 10.20 Data for Exercise 7 


0.15 0.09 0.06 
0.15 0.09 0.06 
0.20 0.12 0.08 


a. Show that the rows and columns of this table are 
independent by verifying that the values p,, satisfy 


the null hypothesis Hp in Eq. (10.3.3). 


b. Generate a random sample of 300 observations from 
the given population using a uniform pseudo-ran- 
dom number generator. Select 300 pseudo-random 
numbers between 0 and 1 and proceed as follows: 
Since pj; = 0.15, classify a pseudo-random number 
x in the first cell if x < 0.15. Since py, + pj2 = 0.24, 
classify a pseudo-random number x in the second cell 
if 0.15 <x < 0.24. Continue in this way for all nine 
cells. For example, since the sum of all probabilities 
except p33 is 0.92, a pseudo-random number x will 
be classified in the lower-right cell of the table if 
x > 0.92. 

c. Consider the 3 x 3 table of observed values N;; gen- 
erated in part (b). Pretend that the probabilities p;; 
were unknown, and test the hypotheses (10.3.3). 


8. If all the students in a class carry out Exercise 7 inde- 
pendently of each other and use different pseudo-random 
numbers, then the different values of the statistic Q ob- 
tained by the different students should form a random 
sample from the x? distribution with four degrees of free- 
dom. If the values of QO for all the students in the class are 
available to you, test the hypothesis that these values form 
such a random sample. 


9. Consider a three-way contingency table of size R x C x 
T. Fori=1,...,R, j=l1,...,C,andk=1,..., T, let 
Pijx denote the probability that an individual selected at 
random from a given population will fall into the (i, /, k) 
cell of the table. Let 
CT R T 
Pi4+4 > > = Pijks P+j+> » rs Pijk> 


j=l k=1 i=l k=1 


RC 
P+tk = 3 > Pijk- 


i=1 j=1 


~— 


On the basis of a random sample of n observations from 
the given population, construct a test of the following 
hypotheses: 


Ao: Pijk = Pi++P+j+P++ for all values of i, j, andk, 


H,: The hypothesis Hp is not true. 


10. Consider again the conditions of Exercise 9. For i = 
1,..., R,and j=1,...,C, let 


f 
Pij+ = a Pijk- 
k=1 


On the basis of a random sample of n observations from 
the given population, construct a test of the following 


hypotheses: 
Ao: Pijk = Pij+P++e for all values of i, j, and k, 


H,: The hypothesis Hp is not true. 


Example 
10.4.1 
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10.4 Tests of Homogeneity 


Imagine that we select subjects from several different populations, and that we 
observe a discrete random variable for each subject. We might be interested in 
whether or not the distribution of that discrete random variable is the same in each 
population. There is a x? test of this hypothesis that is very similar to the x? test of 
independence. 


Samples from Several Populations 


College Survey. Consider again the problem described in Example 10.3.1. There we 
assumed that arandom sample of 200 students was drawn from the entire enrollment 
at a large university and classified in a contingency table according to the curriculum 
in which he is enrolled and according to his preference for either of two political 
candidates A and B. The resulting table appears in Table 10.12. 

Suppose, now, that instead of sampling 200 students at random, we had actually 
sampled separately from each of the four curricula. That is, suppose that we had 
sampled 59 students at random from those enrolled in engineering and science 
along with 48 students selected at random from those enrolled in humanities and 
social sciences and 38 from those enrolled in fine arts and 55 from those enrolled in 
industrial and public administration. After the students are sampled, those in each 
curriculum are then classified according to whether they prefer candidate A or B, or 
are undecided. Suppose that the responses within each curriculum are the same as 
those reported in Table 10.12. 

We might still be interested in investigating whether there is a relationship 
between the curriculum in which a student is enrolled and the candidate he prefers. 
This time, we might word the question of interest as follows: Are the distributions of 
candidate preferences within the different curricula the same or do the students in 
different curricula have different distributions of preferences among the candidates? 

< 


In Example 10.4.1, we are assuming that we have obtained a table of values 
identical to Table 10.12; we are assuming now that this table was obtained by taking 
four different random samples from the different populations of students defined by 
the four rows of the table. This is in contrast to Example 10.3.1, in which we assumed 
that all students were drawn from one population and then classified according to the 
values of two variables: preference and curriculum. In the present context, we are 
interested in testing the hypothesis that, in all four populations, the same proportion 
of students prefers candidate A, the same proportion prefers candidate B, and the 
same proportion is undecided. 

In general, we shall consider a problem in which random samples are taken from 
R different populations, and each observation in each sample can be classified as one 
of C different types. Thus, the data obtained from the R samples can be represented 
in an R x C table. Fori=1,..., R, and j =1,..., C, we shall let Pij denote the 
probability that an observation chosen at random from the ith population will be of 
type j. Thus, 


Cc 
> pi =1 fori=1,...,R. 
j=l 
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The hypotheses to be tested are as follows: 


Ho: Pij = Pay =" = Pry for j= 1,.22.3.€, (10.4.1) 
H,: The hypothesis Hp is not true. 

The null hypothesis Hp in (10.4.1) states that all the distributions from which the R 
different samples are drawn are actually alike, that is, that the R distributions are 
identical. If the null hypothesis in (10.4.1) were true, then combining the R popula- 
tions would produce one homogeneous population with regard to the distribution 
of the random variables we are studying. For this reason, a test of the hypotheses 
(10.4.1) is called a test of homogeneity of the R distributions. 

For i=1,..., R, we shall let N;, denote the number of observations in the 
random sample from the ith population; for j =1,..., C, we shall let N;; denote 
the number of observations in this random sample that are of type j. Thus, 


Cc 
So Ni = Nis fori=1,...,R. 
j=! 


Furthermore, if we let n denote the total number of observations in all R samples 
and N, ; denote the total number of observations of type j in the R samples, then all 
the relations in Eqs. (10.3.1) and (10.3.2) will again be satisfied. 


The x2 Test of Homogeneity 


We shall now develop a test procedure for the hypotheses (10.4.1). Suppose for the 
moment that the probabilities p;; are known, and consider the following statistic 
calculated from the observations in the ith random sample: 


Cc 

-s (Ni; — Nix ij)? 
pa Ni+Pij 
This statistic is just the standard x? statistic, introduced in Eq. (10.1.2), for the random 
sample of N;, observations from the ith population. Therefore, when the sample size 
N,,. is large, the distribution of this statistic will be approximately the x? distribution 
with C — 1 degrees of freedom. 

If we now sum this statistic over the R different samples, we obtain the following 
statistic: 

RC 
(Nij - Nj+Pij)” 


yo (10.4.2) 


iat jad N+ Pi 


Since the observations in the R samples are drawn independently, the distribution 
of the statistic (10.4.2) will be the distribution of the sum of R independent random 
variables, each of which has approximately the x? distribution with C — 1 degrees of 
freedom. Hence, the distribution of the statistic (10.4.2) will be approximately the x? 
distribution with R(C — 1) degrees of freedom. 

Since the probabilities p;; are not actually known, their values must be estimated 
from the observed numbers in the R random samples. When the null hypothesis Ho is 
true, the R random samples are actually drawn from the same distribution. Therefore, 
the M.L.E. of the probability that an observation in each of these samples will be of 
type j is simply the proportion of all the observations in the R samples that are of 
type j. In other words, the M.L.E. of p;; is the same for all values of i (i =1,..., R), 
and this estimator is p;; = N,;/n. When this M.L.E. is substituted into (10.4.2), we 
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obtain the statistic 


R = fe 92 
Q= dX, > eT (10.4.3) 


i=1 j=l ij 
where 
Po ee 


i= (10.4.4) 


n 


It can be seen that Eqs. (10.4.3) and (10.4.4) are precisely the same as Eqs. (10.3.4) 
and (10.3.5). Thus, the statistic Q to be used for the test of homogeneity in this section 
is precisely the same as the statistic Q to be used for the test of independence in 
Sec. 10.3. We shall now show that the number of degrees of freedom is also precisely 
the same for the test of homogeneity as for the test of independence. 

Because the distributions of the R populations are alike when Hp is true, and 
because ae Pij =1 for this common distribution, we have estimated C — 1 pa- 


rameters in this problem. Therefore, the statistic Q will have approximately the x? 
distribution with R(C — 1) — (C — 1) = (R —1)(C — 1) degrees of freedom. This num- 
ber is the same as that found in Sec. 10.3. 

In summary, consider Table 10.12 again. The statistical analysis of this table will 
be the same for either of the following two procedures: The 200 observations are 
drawn as a single random sample from the entire enrollment of the university, and 
a test of independence is carried out; or the 200 observations are drawn as separate 
random samples from four different groups of students, and a test of homogeneity is 
carried out. In either case, in a problem of this type with R rows and C columns, we 
should calculate the statistic Q defined by Eqs. (10.4.3) and (10.4.4), and we should 
assume that its distribution when Hp is true will be approximately the x? distribution 
with (R — 1)(C — 1) degrees of freedom. 


Note: Why the two x? tests look so similar. The reason that the same calculation 
is appropriate for both the x” test of independence and the x? test of homogeneity 
is the following: First, consider the situation of Sec. 10.3, in which one sample is 
drawn and the random variables corresponding to rows and columns are measured. 
Independence of the row and column variables is equivalent to the conditional 
distribution of the column variable given a value of the row variable being the 
same for every value of the row variable. Hence, the test of independence tests that 
the conditional distributions of the column variable are the same for each value of 
the row variable. Next, think of the row variable as defining subpopulations (for 
example, different curricula in Table 10.12). The conditional distributions of the 
column variable given each value of the row variable are the distributions of the 
column variable within each subpopulation. The test of homogeneity tests that the 
distributions within the subpopulations are the same if the samples had been drawn 
separately from each subpopulation rather than drawn at random from the entire 
population. 


Comparing Two or More Proportions 


Television Survey. Suppose that independent samples are drawn from adults in sev- 
eral cities. Each sampled person is asked whether or not they watched a particular 
television program. Suppose that we want to test the null hypothesis Ho that the pro- 
portion of adults who watched a certain television program was the same in each of 
the cities. To be specific, suppose that there are R different cities (R > 2). Suppose 
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Example 
10.4.3 


Table 10.21 Form of table for comparing 
two or more proportions 


Watched Didnot Sample 


City program watch size 

1 Nu Nip Ni4 

2 N41 No Noy 

R Nri Nr2 Nr+ 
that fori =1,..., R,arandom sample of N;, adults is selected from city 7, the num- 


ber in the sample who watched the program is N;1, and the number who did not watch 
the program is N;7 = N;, — N;;. These data can be presented in an R x 2 table such as 
Table 10.21. The hypotheses to be tested will have the same form as the hypotheses 
(10.4.1). Hence, when the null hypothesis Hp is true, that is, when the proportion of 
adults who watched the program is the same in all R cities, the statistic Q defined 
by Eqs. (10.4.3) and (10.4.4) will have approximately the x? distribution with R — 1 
degrees of freedom. | 


The reasoning in Example 10.4.2 extends to other problems in which we wish to 
compare a collection of proportions. 


A Clinical Trial. The data in Table 2.1 (see Example 2.1.4 on page 57) are the numbers 
of subjects in four different treatment groups in a clinical trial together with the 
numbers who did or did not relapse after treatment. We might wish to test the null 
hypothesis that the probability of no relapse is the same in all four treatment groups. 
We can easily compute the statistic Q in Eq. (10.4.3) to be 10.80. This is the 0.987 
quantile of the x? distribution with three degrees of freedom. That is, the p-value is 
0.013, and the null hypothesis of equal probabilities would be rejected at every level 
ay > 0.013. < 


Correlated 2 x 2 Tables 


We shall now describe a type of problem in which the use of the x” test of homogene- 
ity would not be appropriate. Suppose that 100 persons were selected at random in 
a certain city, and that each person was asked whether she thought the service pro- 
vided by the fire department in the city was satisfactory. Shortly after this survey was 
carried out, a large fire occurred in the city. Suppose that after this fire, the same 100 
persons were again asked whether they thought that the service provided by the fire 
department was satisfactory. The results are presented in Table 10.22. 

Table 10.22 has the same general appearance as other tables we have been 
considering in this section. However, it would not be appropriate to carry out a x2 
test of homogeneity for this table, because the observations taken before the fire 
and the observations taken after the fire are not independent. Although the total 
number of observations in Table 10.22 is 200, only 100 independently chosen persons 
were questioned in the surveys. It is reasonable to believe that a particular person’s 
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Table 10.22 Correlated 2 x 2 table 


Satisfactory | Unsatisfactory 


Before the fire 80 20 
After the fire 72. 28 


Table 10.23 2 x 2 table for correlated responses 


After the fire 


Before the fire Satisfactory | Unsatisfactory 


Satisfactory 70 10 
Unsatisfactory 2 18 


opinion before the fire and her opinion after the fire are dependent. For this reason, 
Table 10.22 is called a correlated 2 x 2 table. 

The proper way to display the opinions of the 100 persons in the random sample 
is shown in Table 10.23. It is not possible to construct Table 10.23 from the data 
in Table 10.22 alone. The entries in Table 10.22 are simply the marginal totals of 
Table 10.23. However, in order to construct Table 10.23, it is necessary to go back to 
the original data and, for each person in the sample, to consider her opinion before 
the fire and her opinion after the fire. 

Furthermore, it usually is not appropriate to carry out either a x? test of indepen- 
dence or a x test of homogeneity for Table 10.23, because the hypotheses that are 
tested by either of these procedures usually are not those in which a researcher would 
be interested in this type of problem. In fact, in this problem a researcher would ba- 
sically be interested in the answers to one or both of the following two questions: 
First, what proportion of the persons in the city changed their opinions about the fire 
department after the fire occurred? Second, among those persons in the city who did 
change their opinions after the fire, were the changes predominantly in one direction 
rather than the other? 

Table 10.23 provides information pertaining to both these questions. According 
to Table 10.23, the number of persons in the sample who changed their opinions after 
the fire was 10 + 2 = 12. Furthermore, among the 12 persons who did change their 
opinions, the opinions of 10 of them were changed from satisfactory to unsatisfactory 
and the opinions of two of them were changed from unsatisfactory to satisfactory. On 
the basis of these statistics, it is possible to make inferences about the corresponding 
proportions for the entire population of the city. 

In this example, the M.L.E. 6 of the proportion of the population who changed 
their opinions after the fire is 0.12. Also, among those who did change their opinions, 
the M.L.E. pj, of the proportion who changed from satisfactory to unsatisfactory is 
5/6. Of course, if 6 is very small in a particular problem, then there is little interest in 
the value of py. 
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Summary 


When we sample discrete random variables from several populations, we might 
be interested in the null hypothesis that the distribution of the random variables 
is the same in all populations. We can perform a x? test of this null hypothesis 
as follows: Create a new variable with values equal to the names of the different 
populations. Next, pretend as if each observation consists of the original discrete 
random variable together with the new “population name” variable. Finally, compute 
the x? test statistic Q@ from Sec. 10.3 with the same degrees of freedom. For the type 
of data considered in this section, the “population name” for each observation is 
known before sampling begins, and hence it is not a random variable. Whether the 
population name is known ahead of time or is observed as part of the sampled data 


(as in Sec. 10.3), the mechanics of the x? test are the same. 


Exercises 


1. The survey of Chase and Dummer (1992) discussed in 
Exercise 1 of Sec. 10.3 was actually collected by sampling 
from three subpopulations according to the locations of 
the schools: rural, suburban, and urban. Table 10.24 shows 
the responses to the survey question classified by school 
location. Test the null hypothesis that the distribution of 
responses is the same in all three types of school location. 


Table 10.24 Data for Exercise 1 from Chase 
and Dummer (1992) 


Good Athletic 


grades ability Popularity 
Rural 57 42 50 
Suburban 87 22 42 
Urban 103 26 49 


2. An examination was given to 500 high school seniors in 
each of two large cities, and their grades were recorded as 
low, medium, or high. The results are given in Table 10.25. 
Test the hypothesis that the distributions of scores among 
seniors in the two cities are the same. 


Table 10.25 Data for Exercise 2 


Low Medium High 


City A 103 145 252 
City B 140 136 224 


3. Every Tuesday afternoon during the school year, a cer- 
tain university brought in a visiting speaker to present a 
lecture on some topic of current interest. On the day af- 
ter the fourth lecture of the year, random samples of 70 
freshmen, 70 sophomores, 60 juniors, and 50 seniors were 


selected from the student body at the university, and each 
of these students was asked how many of the four lectures 
she had attended. The results are given in Table 10.26. Test 
the hypothesis that freshmen, sophomores, juniors, and 
seniors at the university attended the lectures with equal 
frequency. 


Table 10.26 Data for Exercise 3 


Number of lectures attended 


0 1 2, 3 4 
Freshmen 10 16 27 6 11 
Sophomores 14 19 20 4 13 
Juniors 15 15 17 4 9 
Seniors 19 8 6 5 12 


4. Suppose that five persons shoot at a target. Suppose 
also that fori =1,...,5, person i shoots n; times and hits 
the target y,; times, and that the values of n; and y; are 
as given in Table 10.27. Test the hypothesis that the five 
persons are equally good marksmen. 


Table 10.27 Data for Exercise 4 


i nj Yi 
1 17 8 
2 16 

3 10 7 
4 24 13 
5 16 10 


5. A manufacturing plant has preliminary contracts with 
three different suppliers of machines. Each supplier de- 
livered 15 machines, which were used in the plant for four 
months in preliminary production. It turned out that one 
of the machines from supplier 1 was defective, seven of 
the machines from supplier 2 were defective, and seven 
of the machines from supplier 3 were defective. The plant 
statistician decided to test the null hypothesis Hp that the 
three suppliers provided the same quality. Therefore, he 
set up Table 10.28 and carried out a x? test. By summing 
the values in the bottom row of Table 10.28, he found that 
the value of the x? statistic was 24/5 with two degrees of 
freedom. He then found from a table of the x? distribution 
that Hp should be accepted when the level of significance 
is 0.05. Criticize this procedure and provide a meaningful 
analysis of the observed data. 


Table 10.28 Data for Exercise 5 


Supplier 
1 2 3 
Number of defectives N; 1 
Expected number of defectives 5 
E; under Hp 
(N; — E;) 16 4 4 
E; 3 5 5 


L 
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6. Suppose that 100 students in a physical education class 
shoot at a target with a bow and arrow, and 27 students 
hit the target. These 100 students are then given a demon- 
stration on the proper technique for shooting with the bow 
and arrow. After the demonstration, they again shoot at 
the target. This time 35 students hit the target. What addi- 
tional information, if any, is needed in order to investigate 
the hypothesis that the demonstration was helpful? 


7. As people entered a certain meeting, m persons were se- 
lected at random, and each was asked either to name one 
of two political candidates she favored in a forthcoming 
election or to say “undecided” if she had no real prefer- 
ence. During the meeting, the people heard a speech on 
behalf of one of the candidates. After the meeting, each of 
the same n persons was again asked to express her opin- 
ion. Describe a method for evaluating the effectiveness of 
the speaker. 


10.5 Simpson’s Paradox 


When tabulating discrete data, we need to be careful about aggregating groups. 
Suppose that a survey has two questions. If we construct a single table of responses 
to the two questions that includes both men and women, we might get a very 
different picture than if we construct separate tables for the responses of men and 


women. 


An Example of the Paradox 


Example 
10.5.1 


Comparing Treatments in an Aggregated Table. Suppose that an experiment is carried 
out in order to compare a new treatment for a particular disease with the standard 


treatment for the disease. In the experiment, 80 subjects suffering from the disease 
are treated, 40 subjects receiving the new treatment and 40 receiving the standard 
treatment. After a certain period of time, it is observed how many of the subjects in 
each group have improved and how many have not. Suppose that the overall results 
for all 80 patients are as shown in Table 10.29. 

According to this table, 20 of the 40 subjects who received the new treatment 
improved, and 24 of the 40 subjects who received the standard treatment improved. 
Thus, 50 percent of the subjects improved under the new treatment, whereas 60 
percent improved under the standard treatment. On the basis of these results, the 
new treatment appears inferior to the standard treatment. S| 
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Table 10.29 Results of experiment comparing two treatments 
Percent 
All patients Improved Notimproved improved 
New treatment 20 20 50 
Standard treatment 24 16 60 
Table 10.30 Table 10.29 disaggregated by sex 
Percent 
Men only Improved Notimproved improved 
New treatment 12 18 40 
Standard treatment 3 7 30 
Women only 
New treatment 8 2 80 
Standard treatment 21 9 70 


Many contingency tables, such as Table 10.29, summarize the results of a study 
in only one of several possible ways. The next example looks at the same data from 
a different point of view and draws a different conclusion. 


Comparing Treatments in an Disaggregated Table. In order to investigate more carefully 
the efficacy of the new treatment in Example 10.5.1, we might compare it with the 
standard treatment just for the men in the sample and, separately, just for the women 
in the sample. The results in Table 10.29 can thus be partitioned into two tables, 
one pertaining just to men and the other just to women. This process of splitting 
the overall data into disjoint components pertaining to different subgroups of the 
population is called disaggregation. 

Suppose that when the values in Table 10.29 are disaggregated by considering 
the men and the women separately, the results are as shown in Table 10.30. It can be 
verified that when the data in these separate tables are combined, or aggregated, we 
again obtain Table 10.29. However, Table 10.30 contains a big surprise because the 
new treatment appears to be superior to the standard treatment both for men and 
for women. Specifically, 40 percent of the men (12 out of 30) who received the new 
treatment improved, but only 30 percent of the men (3 out of 10) who received the 
standard treatment improved. Furthermore, 80 percent of the women (8 out of 10) 
who received the new treatment improved, but only 70 percent of the women (21 out 
of 30) who received the standard treatment improved. <1 


Tables 10.29 and 10.30 together yield somewhat anomalous results. According 
to Table 10.30, the new treatment is superior to the standard treatment both for men 
and for women, but according to Table 10.29, the new treatment is inferior to the 
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standard treatment when all the subjects are aggregated. This type of result is known 
as Simpson's paradox. 

It should be emphasized that Simpson’s paradox is not a phenomenon that occurs 
because we are working with small samples. The small numbers in Tables 10.29 and 
10.30 were used merely for convenience in this explanation. Each of the entries in 
these tables could be multiplied by 1000 or by 1,000,000 without changing the results. 


The Paradox Explained 


Of course, Simpson’s paradox is not actually a paradox; it is merely a result that is 
surprising and puzzling to someone who has not seen or thought about it before. It 
can be seen from Table 10.30 that in the example we are considering, women have 
a higher rate of improvement from the disease than men have, regardless of which 
treatment they receive. Furthermore, most of the women in the sample received the 
standard treatment while most of the men received the new treatment. Specifically, 
among the 40 men in the sample, 30 received the new treatment, and only 10 received 
the standard treatment, whereas among the 40 women in the sample, these numbers 
are reversed. 

The new treatment looks bad in the aggregated table because most of the people 
who weren’t going to respond well to either treatment got the new treatment while 
most of the people who were going to respond well to either treatment got the 
standard treatment. Even though the numbers of men and women in the experiment 
were equal, a high proportion of the women and a low proportion of the men received 
the standard treatment. Since women have a much higher rate of improvement than 
men, it is found in the aggregated Table 10.29 that the standard treatment manifests 
a higher overall rate of improvement than does the new treatment. 

Simpson’s paradox demonstrates dramatically the dangers in making inferences 
from an aggregated table like Table 10.29. To make sure that Simpson’s paradox 
cannot occur in an experiment like the one just described, the proportions of men 
and women among the subjects who receive the new treatment must be the same, or 
approximately the same, as the proportions of men and women among the subjects 
who receive the standard treatment. It is not necessary that there be equal numbers 
of men and women in the sample. 

We can express Simpson’s paradox in probability terms. Let A denote the event 
that a subject chosen for the experiment will be a man, and let A° denote the event 
that the subject will be a woman. Also, let B denote the event that a subject will 
receive the new treatment, and let B° denote the event that the subject will receive 
the standard treatment. Finally, let J denote the event that a subject will improve. 
Simpson’s paradox then reflects the fact that it is possible for all three of the following 
inequalities to hold simultaneously: 


Pr([J|A NB) > Pr(I|AN B®), 
Pr(I|A°N B) > Pr(I|A°N B‘), (10.5.1) 
Pr(I|B) < Pr(/|B°). 

The discussion that we have just given in regard to the prevention of Simpson’s 
paradox can be expressed as follows: If Pr(A|B) = Pr(A|B°), then it is not possible 
for all three inequalities in (10.5.1) to hold (see Exercise 5). Similarly, if Pr(B|A) = 
Pr(B|A‘), then it is not possible for all three inequalities in (10.5.1) to hold (see 
Exercise 3). 


The possibility of Simpson’s paradox lurks within every contingency table. Even 
though we might take care to design a particular experiment so that Simpson’s 
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paradox cannot occur when we disaggregate with respect to men and women, it is 
always possible that there is some other variable, such as the age of the subject or the 
intensity and the stage of the disease, with respect to which disaggregation would lead 
us to a conclusion directly opposite to that indicated by the aggregated table. Once an 
experiment is designed to prevent Simpson’s paradox with respect to disaggregations 
that can be identified in advance, subjects are generally assiged randomly to the 
possible treatments in the hopes of minimizing the chance that Simpson’s paradox 
will arise with respect to an unforeseen disaggrageation. 


Comparing Treatments in an Aggregated Table. In the example of this section, it would 
be sensible to assign 20 men and 20 women to each of the two treatments. Which 
20 men and which 20 women get assigned to each treatment would be determined 
by randomization in order to minimize the chance of an unforeseen occurrence of 
Simpson’s paradox. 

If there were other information, such as severity of disease, that were available at 
the start of the experiment, the groups of men and women should each be partitioned 
according to that additional information before being randomly assigned to the 
treatments. For example, suppose that 12 men and 8 women have more severe cases 
of the disease before the experiment begins. We should then assign 6 of the men and 4 
of the women with more severe cases to each tretment. We should also assign 4 of the 
men and 6 of the women with less severe cases to each treatment. This balances the 
factors (sex, severity, and treatment) that are expected to affect the experimental 
outcome. If there is another unforeseen factor that will affect the outcome, it is 
still possible, but unlikely, that the random assignment described above will allow 
Simpson’s paradox to arise with regard to that one factor. If there are dozens of 
additional important factors, some degree of imbalance will be inevitable even with 
a randomized assignment. 4 


Summary 


Simpson’s paradox occurs when the relationship between the two categorical vari- 
ables in every part of a disaggregated table is the opposite of the relationship between 
those same two variables in the aggregated table. 


Exercises 


1. Consider two populations I and II. Suppose that 80 per- 
cent of the men and 30 percent of the women in population 
I have a certain characteristic, and that only 60 percent of 
the men and 10 percent of the women in population IT 
have the characteristic. Explain how, under these condi- 
tions, it might be true that the proportion of population IT 
having the characteristic is larger than the proportion of 
population I having the characteristic. 


2. Suppose that A and B are events such that 0 < Pr(A) < 
1 and 0 < Pr(B) < 1. Show that Pr(A|B) = Pr(A|B°) if and 
only if Pr(B|A) = Pr(B|A‘). 


3. Show that all three inequalities in (10.5.1) cannot hold 
if Pr(B|A) = Pr(B|A°). 


4. Suppose that each adult subject in an experiment is 
given either treatment I or treatment II. Prove that the 
proportion of men among the subjects who receive treat- 
ment I is equal to the proportion of men among the sub- 
jects who receive treatment IT if and only if the proportion 
of all men in the experiment who receive treatment I is 
equal to the proportion of all women who receive treat- 
ment I. 


5. Show that all three inequalities in (10.5.1) cannot hold 
if Pr(A|B) = Pr(A|B°). 


6. It was believed that a certain university was discrim- 
inating against women in its admissions policy because 
30 percent of all the male applicants to the university were 


admitted, whereas only 20 percent of all the female appli- 
cants were admitted. In order to determine which of the 
five colleges in the university were most responsible for 
this discrimination, the admissions rates for each college 
were analyzed separately. Surprisingly, it was found that 
in each college the proportion of female applicants who 
were admitted to the college was actually larger than the 
proportion of male applicants who were admitted. Discuss 
and explain this result. 


7. In an experiment involving 800 subjects, each subject 
received either treatment I or treatment IT, and each sub- 
ject was classified into one of the following four categories: 
older males, younger males, older females, and younger 
females. At the end of the experiment, it was determined 
for each subject whether the treatment that the subject 
had received was helpful or not. The results for each of 
the four categories of subjects are given in Table 10.31. 


a. Show that treatment II is more helpful than treat- 
ment I within each of the four categories of subjects. 


b. Show that if these four categories are aggregated into 
only the two categories, older subjects and younger 
subjects, then treatment I is more helpful than treat- 
ment II within each of these categories. 
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c. Show that if the two categories in part (b) are aggre- 
gated into a single category containing all 800 sub- 
jects, then treatment II again appears to be more 
helpful than treatment I. 


Table 10.31 Data for Exercise 7 


Older males Helpful Not 
Treatment I 120 120 
Treatment II 20 10 


Younger males 


Treatment I 60 20 
Treatment II 40 10 


Older females 


Treatment I 10 50 
Treatment II 20 50 


Younger females 


Treatment I 10 10 
Treatment II 160 90 


* 10.6 Kolmogorov-Smirnov Tests 


In Sec. 10.1, we used the x? test to test the null hypothesis that a random sample 
came from a particular continuous distribution against the alternative hypothesis 
that the sample did not come from that distribution. A more suitable test for these 
hypotheses is introduced in this section. This test can also be extended to test the 
null hypothesis that two independent samples came from the same distribution 
against the alternative hypothesis that they came from two different distributions. 


The Sample Distribution Function 


Example 
10.6.1 


Failure Times of Ball Bearings. In Example 10.1.6, we used a x” goodness-of-fit test 
to test the null hypothesis that the log-failure times of ball bearings came from the 
normal distribution with mean 3.912 and variance 0.25. That test required us to 
choose a somewhat arbitrary partition of the real line in order to convert the log- 
failure times into count data. Is there a test procedure for such problems that does 
not require an arbitrary aggregation into intervals that may have no physical meaning 
in the application? 4 


The first step in trying to answer the question in Example 10.6.1 is to construct an 
estimator of the distribution of the random sample that does not rely on the assump- 
tion that the distribution was normal. Suppose that the random variables X1,..., X,, 
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Figure 10.1 Sample c.d.f. of 
log-failure times of ball bear- 
ings together with the c.d.f. of 
the normal distribution with 

mean 3.912 and variance 0.25. 


form a random sample from some continuous distribution, and let x1, ..., x, denote 
the observed values of X;,..., X,,. Since the observations come from a continuous 
distribution, there is probability 0 that any two of the observed values x1, ..., x, will 
be equal. Therefore, we shall assume for simplicity that all n values are different. We 
shall consider now a function F,,(x), which is constructed from the values x1, ..., X, 
and will serve as an estimate of the c.d.f. from which the sample was drawn. 


Sample (Empirical) Distribution Function. Let x1, ..., x, be the observed values of a 
random sample X,..., X,. For each number x (—oo < x < oo), define the value 
F,,(x) as the proportion of observed values in the sample that are less than or equal 
to x. In other words, if exactly k of the observed values in the sample are less than 
or equal to x, then F,,(x) =k/n. The function F,,(x) defined in this way is called the 
sample distribution function, or simply the sample c.d.f: Sometimes F,,(x) is called the 
empirical c.d.f. 


The sample c.d.f. for the data discussed in Example 10.6.1 appears in Fig. 10.1 
together with the hypothesized normal c.d.f. mentioned in that example. 

In general, the sample c.d.f. F,,(x) can be regarded as the c.d.f. of a discrete 
distribution that assigns probability 1/n to each of the n values x1, ..., x,,. Thus, F,(x) 
will be a step function with a jump of magnitude 1/n at each point x; (i =1,..., 7). 
If we let yy < y) <--- < y, denote the values of the order statistics of the sample, as 
defined in Definition 7.8.2, then F,,(x) = 0 for x < y,; F,,(x) jumps to the value 1/n at 
x = y,; and remains at 1/n for y, <x < y; F,,(x) jumps to the value 2/n at x = yy and 
remains at 2/n for yy < x < y3; and so on. 

Now let F(x) denote the c.d.f. of the distribution from which the random sample 
X1,..., X, was drawn. For each given number x (—oo < x < oo), the probability 
that any particular observation X; will be less than or equal to x is F(x). Therefore, 
it follows from the law of large numbers that as n > oo, the proportion F,,(x) of 
observations in the sample that are less than or equal to x will converge in probability 
to F(x). In symbols, 


F(x) ate F(x) for -o<x<o. (10.6.1) 


The relation (10.6.1) expresses the fact that at each point x, the sample c.d.f. F,,(x) 
will converge to the actual c.d.f. F(x) of the distribution from which the random 
sample was taken. A collection of sample c.d.f.’s is sketched in Fig. 10.2 for a few 
different sized samples from the the same distribution. 


Sample c.d.f. 


Log-failure time 


Figure 10.2 The sample 
c.d.f. F(x) for n = 4, 8, 16. 


Figure 10.3 The value of 
D 


n* 


Theorem 
10.6.1 
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An even stronger result, known as the Glivenko-Cantelli lemma, states that 
F,,(x) will converge to F(x) uniformly over all values of x. The proof is beyond the 
scope of this book. 


Glivenko-Cantelli Lemma. Let F,, be the sample c.d.f. from ani.id. sample X),..., X,, 
from the c.d.f. F. Define 
D,= sup |F,(x)— F(Q@)|. (10.6.2) 
—-WO<X¥<0O 
Then D,, £G,  ] 


A value of D,, is illustrated in Fig. 10.3 for a typical example. Before the values of 
X\,..., X, have been observed, the value of D,, is a random variable. 

Theorem 10.6.1 implies that when the sample size n is large, the sample c.d.f. 
F,, (x) is quite likely to be close to the c.d.f. F(x) over the entire real line. In this sense, 
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when the c.d.-f. F(x) is unknown, the sample c.d.f. F(x) can be considered to be an 
estimator of F (x). In another sense, however, F,,(x) is not a very reasonable estimator 
of F(x). As we explained earlier, F,, (x) will be the c.d.f. of a discrete distribution that 
is concentrated on n points, whereas we are assuming in this section that the unknown 
c.d.f. F(x) is the c.d.f. of a continuous distribution. Some type of smoothed version of 
F,,(x), from which the jumps have been removed, might yield a reasonable estimator 
of F(x), but we shall not pursue this topic further here. 


The Kolmogorov-Smirnov Test of a Simple Hypothesis 


Suppose now that we wish to test the simple null hypothesis that the unknown c.d.f. 
F(x) is actually a particular continuous c.d.f. F*(x) against the general alternative 
that the actual c.d.f. is not F*(x). In other words, suppose that we wish to test the 
following hypotheses: 
Hy: F(x)= ee): OE =e <x <0, (10.6.3) 
H,: The hypothesis Hp is not true. 
This problem is a nonparametric problem because the unknown distribution from 
which the random sample is taken might be any continuous distribution. 

In Sec. 10.1, we described how the x? test of goodness-of-fit can be used to 
test hypotheses having the form (10.6.3). That test, however, requires grouping the 
observations into a finite number of intervals in an arbitrary manner. We shall now 
describe a test of the hypotheses (10.6.3) that does not require such grouping. 

As before, we shall let F,,(x) denote the sample c.d.f. Also, we shall now let D* 
denote the following statistic: 

D = sup |F,@)— PF"). (10.6.4) 
—-O<X¥<0O 
In other words, D* is the maximum difference between the sample c.d.f. F,,(x) and 
the hypothesized c.d.f. F*(x). When the null hypothesis Hp in (10.6.3) is true, the 
probability distribution of D* will be a certain distribution that is the same for every 
possible continuous c.d.f. F*(x) and does not depend on the particular c.d.f. F*(x) 
being studied in a specific problem. (See Exercise 13.) Tables of this distribution, for 
various values of the sample size n, have been developed and are presented in many 
published collections of statistical tables. 

It follows from the Glivenko-Cantelli lemma that the value of D* will tend to 
be small if the null hypothesis Ho is true, and D* will tend to be larger if the actual 
c.d.f. F(x) is different from F*(x). Therefore, a reasonable test procedure for the 
hypotheses (10.6.3) is to reject Hp if n'/ pos > c, where c is an appropriate constant. 

It is convenient to express the test procedure in terms of n!/ ope rather than 
simply D*, because of the following result, which was established in the 1930s by 
A. N. Kolmogorov and N. V. Smirnov. 


If the null hypothesis Ho is true, then for each given value t > 0, 


CO 
Jim Prin! D* <1) =1-2 Yi(pirte 2", (10.6.5) 
i=l 


Thus, if the null hypothesis Hp is true, then as n > oo, the c.d.f. of ni/ “pe will 
converge to the c.d.f. given by the infinite series on the right side of Eq. (10.6.5). For 
each value of t > 0, we shall let H(t) denote the value on the right side of Eq. (10.6.5). 
The values of H(t) are given in Table 10.32. 


Definition 
10.6.2 


Example 
10.6.2 
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Table 10.32 The c.d.f. H in Eq. (10.6.5) 
t H(t) t H(t) 
0.30 0.0000 1.20 ~=—-0.8878 
0.35 0.0003 1.25 0.9121 
0.40 0.0028 1.30  =—0.9319 
0.45 0.0126 1.35 0.9478 
0.50 0.0361 1.40 0.9603 
0.55 0.0772 1.45 0.9702 
0.60 0.1357 1.50 0.9778 
0.65 0.2080 1.60 0.9880 
0.70 0.2888 1.70 = 0.9938 
0.75 0.3728 1.80 0.9969 
0.80 0.4559 1.90 0.9985 
0.85 0.5347 2.00 0.9993 
0.90 0.6073 2.10 0.9997 
0.95 0.6725 2.20 0.9999 
1.00 0.7300 2.30 0.9999 
1.05 0.7798 2.40 1.0000 
1.10 0.8223 2.50 1.0000 

1.15 0.8580 


Kolmogorov-Smirnov test. A test procedure that rejects Hy when n/? D* > c is called 
a Kolmogorov-Smirnov test. 


It follows from Eq. (10.6.5) that when the sample size n is large, the constant 
c can be chosen from Table 10.32 to achieve, at least approximately, any specified 
level of significance ag (0 < ag < 1). In fact, we should choose c to be the 1 — ag 
quantile H~'(1 — ap) of the distribution H. For example, by examining Table 10.32, 
we see that H(1.36) ~ 0.95, so H~!(1 — 0.05) = 1.36. Therefore, if the null hypothesis 
Hp is true, then Pr(n!/ “pe > 1.36) = 0.05. It follows that the level of significance of a 
Kolmogorov-Smirnov test with c = 1.36 will be 0.05. 


Testing Whether a Sample Comes from a Standard Normal Distribution. Suppose that it 
is desired to test the null hypothesis that a certain random sample of 25 observations 
was drawn from a standard normal distribution against the alternative that the 
random sample was drawn from some other continuous distribution. The 25 observed 
values in the sample, in order from the smallest to the largest, are designated as 
yy, +++, Yas and are listed in Table 10.33. The table also includes the value F,,(y;) of 
the sample c.d.f. and the value ®(y;) of the c.d.f. of the standard normal distribution. 

By examining the values in Table 10.33, we find that D*, which is the largest dif- 
ference between F,,(x) and ®(x), occurs when we pass from i = 4 to i =S, that is, 
as x increases from the point x = —0.99 toward the point x = —0.42. The compar- 
ison of F(x) and ®(x) over this interval is illustrated in Fig. 10.4, from which we 
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Figure 10.4 The value of 
D* in Example 10.6.2. 


Table 10.33 Calculations for Kolmogoroy- 
Smirnov test 


i Yi F,,() (yj) 
1 —2.46 0.04 0.0069 
2 —2.11 0.08 0.0174 
3 —1.23 0.12 0.1093 
4 —0.99 0.16 0.1611 
BS) —0.42 0.20 0.3372 
6 —0.39 0.24 0.3483 
7 —0.21 0.28 0.4168 
8 —0.15 0.32 0.4404 
9 —0.10 0.36 0.4602 

10 —0.07 0.40 0.4721 
11 —0.02 0.44 0.4920 
12 0.27 0.48 0.6064 
13 0.40 0.52 0.6554 
14 0.42 0.56 0.6628 
15 0.44 0.60 0.6700 
16 0.70 0.64 0.7580 
17 0.81 0.68 0.7910 
18 0.88 0.72 0.8106 
19 1.07 0.76 0.8577 
20 1.39 0.80 0.9177 
21 1.40 0.84 0.9192 
22 1.47 0.88 0.9292 
23 1.62 0.92 0.9474 
24 1.64 0.96 0.9495 
25 1.76 1.00 0.9608 
P(x) 


#V 


Example 
10.6.3 
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see that D* = 0.3372 — 0.16 = 0.1772. Since n = 25 in this example, it follows that 
ae os = 0.886. From Table 10.32, we find that H(0.886) = 0.6. Hence, the tail area 
corresponding to the observed value of n/ 0s is 0.4, and we would not reject the 
null hypothesis at levels w) smaller than 0.4. J 


It is important to emphasize again that when the sample size n is large, even a 
small value of the tail area corresponding to the observed value of n!/ 5D would not 
necessarily indicate that the true c.d.f. F(x) was much different from the hypothesized 
c.d.f. ®(x). When nv itself is large, even a small difference between the c.d.f. F(x) and 
the c.d.f. ®(x) would be sufficient to generate a large value of n!/ “De Therefore, 
before a statistician rejects the null hypothesis, he should make certain that there is 
a plausible alternative c.d.f. with which the sample F,,(x) provides closer agreement. 


The Kolmogorov-Smirnov Test for Two Samples 


Calcium Supplements and Blood Pressure. Exercise 10 in Sec. 9.6 contains data from 
a study of the effect of a calcium supplement on blood pressure. A group of m = 10 
men received a calcium supplement, and another group of n = 11 men received a 
placebo. At the end of the study, the differences were calculated between each man’s 
blood pressures at the start and at the end of a 12-week period. Suppose that we are 
not willing to assume that the distributions of the measured differences are normal 
distributions. Can we still construct a procedure for testing the null hypothesis that 
the distributions of differences in the treatment and placebo groups are the same 
versus the alternative hypothesis that the distributions are different? < 


Consider a problem in which a random sample of m observations X;,..., Xj iS 
taken from a distribution for which the c.d.f. F(x) is unknown, and an independent 
random sample of n observations Y;,..., Y, is taken from another distribution for 
which the c.d.f. G(x) is also unknown. We shall assume that both F(x) and G(x) are 
continuous functions and that it is desired to test the hypothesis that these functions 
are identical, without specifying their common form. Thus, the following hypotheses 
are to be tested: 

Hp: F(x)=G(x) for —cwo <x <o, 
° _— (10.6.6) 
H,: The hypothesis Hp is not true. 

We shall let F,,,(x) denote the sample c.d.f. calculated from the observed values 
of X1,..., X,, and let G,,(x) denote the sample c.d.f. calculated from the observed 
values of Y;,..., Y,,. Furthermore, we shall consider the statistic D,,,,, which is 
defined as follows: 

Dnn= Sup |Fin(x) — Gy (x)]. (10.6.7) 
—O<X <0O 
The value of D,,,,, is illustrated in Fig. 10.5 for a typical example in which m = 5 and 
n=3. 

When the null hypothesis Hp is true and F(x) and G(x) are identical functions, 
the sample c.d.f.’s F,, (x) and G,,(x) will tend to be close to each other. In fact, when 
Ho is true, it follows from the Glivenko-Cantelli lemma that 


Din "0, as bothm > oo andn > oo. (10.6.8) 


It seems reasonable, therefore, to use a test procedure that specifies rejecting Hp 
when D,,,,, is large. The following theorem, whose proof is beyond the scope of this 
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Figure 10.5 A representa- 
tion of F,, (x), G,,(x), and D 
form =Sandn=3. 
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Theorem 
10.6.3 


Definition 
10.6.3 


Example 
10.6.4 
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text, gives us the asymptotic distribution of D,,,, which we can use to construct an 
approximate test. 


Two-Sample Kolmogorov-Smirnov Statistic. For each value of t > 0, let H(t) denote the 
right side of Eq. (10.6.5). If the null hypothesis Hp in (10.6.6) is true, then 


mn ie 
lim P| ( ) Dinn < ( = H(t). (10.6.9) 
m—> 00, N->0o m+n 


Values of the function H(t) are given in Table 10.32. The large-sample approxi- 
mate test of the hypotheses in (10.6.6) makes use of the statistic in (10.6.9). 


Two-Sample Kolmogorov-Smirnov Test. A test procedure that rejects Hy) when 


ii 1/2 
( ) Dyn > ¢; (10.6.10) 


m+n 


where c is an appropriate constant, is called a Kolmogorov-Smirnov two-sample test. 


Hence, when the sample sizes m and are large, the constant c in the relation (10.6.10) 
can be chosen from Table 10.32 to achieve, at least approximately, any specified level 
of significance. For example, if m and n are large, and the test is to be carried out at 
the level of significance 0.05, then it follows from Table 10.32 that we should choose 
c= H~!(0.95) = 1.36. 


Calcium Supplements and Blood Pressure. Return to situation described in Exam- 
ple 10.6.3. We are interested in whether or not the changes in blood pressure for 
men treated with a calcium suppletment have the same distribution as the changes in 
blood pressure for men treated with a placebo. Figure 10.6 displays the sample c.d.f’s 
of the measured changes in the treatment and placebo groups. It is not difficult to 
see that the maximum difference occurs for 5 < x <7. In fact, D,,, = 0.409, and the 
test statistic is (110/21)!/* x 0.409 = 0.936. From Table 10.32, we see that H (0.936) is 
about 0.654. So we would reject the null hypothesis that the two samples were drawn 
from the same population at every level ap > 0.346. < 


Summary 


We introduced Kolmogorov-Smirnov tests for testing the null hypotheses that a ran- 
dom sample arose from a particular distribution and that two independent random 
samples arose from the same distribution. For the one-sample test, we compute D,,, 
the largest difference between the sample c.d.f. and the null hypothesis c.d-f., and 
we reject the null hypothesis at level a if n!/ i > H~'(1—a), where H is the 
c.d.f. shown in Table 10.32. For the two-sample test, we compute D,,,,, the largest 


Figure 10.6 The sample Empirical df. A 
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Differences in blood pressure 


difference between the two sample c.d.f.’s from the two different samples. We then 
reject the null hypothesis that the two samples arose from the same distribution at 
level ag if (mn/(m +n))"/? Dinn => H-1(1 — a). 


Exercises 


1. Suppose that the ordered values in a random sample 
of five observations are y; < yo < y3 < y4 < ys. Let F,,(x) 
denote the sample c.d.f. constructed from these values, 
let F(x) be a continuous c.d-f., and let D, be defined by 
Eq. (10.6.2). Prove that the minimum possible value of D,, 
is 0.1, and prove that D, = 0.1 if and only if F(y,) = 0.1, 
F (yp) = 0.3, F(y3) = 0.5, F(y4) = 0.7, and F (ys) = 0.9. 


2. Consider again the conditions of Exercise 1. Prove that 
D,, < 0.2 if and only if Fy) < 0.2 < F(y2) < 0.4 < F(y3) < 
0.6 < F()4) < 0.8 < F(ys). 


3. Use the data in Example 10.1.6. In that example, we 
used a x? goodness-of-fit test to test the null hypothesis 
that the logarithms of failure times for ball bearings had 
the normal distribution with mean 3.912 and variance 0.25. 
Now, use the Kolmogorov-Smirnov test to test that same 
null hypothesis. 


4. Use the Kolmogorov-Smirnov test to test the hypothe- 
sis that the 25 values in Table 10.34 form a random sample 
from the uniform distribution on the interval [0, 1]. 


for0 <x <5, 


for 5 <x <1, 


f(x) = 


SO NR NIW 


otherwise. 


6. Consider again the conditions of Exercise 4 and 5. Sup- 
pose that the prior probability is 1/2 that the 25 values 
given in Table 10.34 were obtained from the uniform dis- 
tribution on the interval [0, 1], and 1/2 that they were ob- 
tained from the distribution for which the p.d.f. is as given 
in Exercise 5. Find the posterior probability that they were 
obtained from a uniform distribution. 


7. Use the Kolmogorov-Smirnov test to test the hypothe- 
sis that the 50 values in Table 10.35 form a random sample 
from the normal distribution for which the mean is 26 and 
the variance is 4. 


Table 10.35 Data for Exercise 8 


Table 10.34 Data for Exercise 4 

0.42 0.06 0.88 0.40 0.90 
0.38 0.78 0.71 0.57 0.66 
0.48 0.35 0.16 0.22 0.08 
0.11 0.29 0.79 0.75 0.82 
0.30 0.23 0.01 0.41 0.09 


5. Use the Kolmogorov-Smirnov test to test the hypoth- 
esis that the 25 values given in Exercise 4 form a random 
sample from the continuous distribution for which the 
p.d.f. f(x) is as follows: 


25.088 26.615 25.468 27.453 23.845 
25.996 26.516 28.240 25.980 30.432 
26.560 25.844 26.964 23.382 25.282 
24.432 23.593 24.644 26.849 26.801 
26.303 23.016 27.378 25.351 23.601 
24.317 29.778 29.585 22.147 28.352 
29.263 27.924 21.579 25.320 28.129 
28.478 23.896 26.020 23.750 24.904 
24.078 27.228 27.433 23.341 28.923 
24.466 25.153 25.893 26.796 24.743 
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8. Use the Kolmogorov-Smirnov test to test the hypothe- 
sis that the 50 values given in Table 10.35 form a random 
sample from the normal distribution for which the mean 
is 24 and the variance is 4. 


9. Suppose that 25 observations are selected at random 
from a distribution for which the c.d.f. F(x) is unknown, 
and that the values given in Table 10.36 are obtained. Sup- 
pose also that 20 observations are selected at random from 
another distribution for which the c.d.f. G(x) is unknown, 
and the values given in Table 10.37 are obtained. Use the 
Kolmogoroy-Smirnov test to test the hypothesis that F (x) 
and G(x) are identical functions. 


Table 10.36 First sample for Exercise 9 


0.61 0.29 0.06 0.59 —1.73 
—0.74 0.51 —0.56 —0.39 1.64 
0.05 —0.06 0.64 —0.82 0.31 
1.77 1.09 —1.28 2.36 1.31 
1.05 —0.32 —0.40 1.06 —2.47 


Table 10.37 Second sample for Exercise 9 


2.20 1.66 1.38 0.20 
0.36 0.00 0.96 1.56 
0.44 1.50 —0.30 0.66 
2.31 3.29 —0.27 —0.37 
0.38 0.70 0.52 —0.71 


10. Consider again the conditions of Exercise 9. Let X 
denote a random variable for which the c.d.f. is F(x), and 
let Y denote a random variable for which the c.d.f. is G(x). 
Use the Kolmogorov-Smirnov test to test the hypothesis 


that the random variables X +2 and Y have the same 
distribution. 


11. Consider again the conditions of Exercises 9 and 10. 
Use the Kolmogorov-Smirnov test to test the hypothesis 
that the random variables X and 3Y have the same distri- 
bution. 


12. In Example 9.6.3, we compared two samples of alu- 
minum oxide measurements taken from Roman-era pot- 
tery that was found in two different locations in Britain. 
The m = 14 measurements taken from the Llanederyn re- 
gion are 


10.1, 10.9, 11.1, 11.5, 11.6, 12.4, 12.5, 12.7, 
13.1, 13.4, 13.8, 13.8, 14.4, 14.6. 


The n = 5 measurements from Ashley Rails are 
14.8, 16.7, 17.7, 18.3, 19.1. 


Use the Kolmogorov-Smirnov two-sample test to test the 
null hypothesis that the two distributions from which these 
samples are drawn are the same. 


13. Suppose that X;,..., X, form a random sample with 
unknown c.d.f. F. Prove the claim made after Eq. (10.6.4) 
that the distribution of the statistic D*, given that the null 
hypothesis in (10.6.3) is true, is the same for all continu- 
ous F*. Hint: Let Z; = F*(X;) fori =1,...,, and con- 
sider testing the null hypothesis that Z;,..., Z, have the 
uniform distribution on the interval [0, 1]. Show that the 
statistic D* for this modified problem is identical to the 


me n 
original D*. 


14. Perform the Kolmogorov-Smirnov test of the null hy- 
pothesis in Example 10.6.1. Report the result of the test 
by giving the p-value. The sample data appear in Exam- 
ple 10.1.6. 


* 10.7 Robust Estimation 


In many statistical problems, we might not feel comfortable assuming that the 
distribution of our data X is a member of a single parametric family. Suppose that 
we consider using an estimator T =r(X) of some parameter 0. It might be that 
T has good properties if X is a random sample from, say, a normal distribution. 
On the other hand, we might be concerned about how T would behave if X were 
actually a sample from a different distribution. In this section, we introduce a new 
class of distributions and several new Statistics. We then compare the behaviors 
of these statistics (and some old ones) when the data arise from one of the new 
distributions (and from some old ones). An estimator is called robust if it performs 
well, compared to other estimators, regardless of the distribution that gives rise to 


the data. 


Example 
10.7.1 


Example 
10.7.2 
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Estimating the Median 


Rain from Seeded Clouds. In Fig. 8.3, we presented the histogram of log-rainfalls from 
26 seeded clouds, which is slightly asymmetric. A scientist might be uncomfortable 
treating the log-rainfalls as normal random variables. Nevertheless, one may still wish 
to estimate the median or some other feature of the distribution of log-rainfalls. One 
might wish to use a method of estimation that does not rely for its justification on the 
assumption that the data form a random sample from a normal distribution. < 


Suppose that the random variables X,,..., X,, form a random sample from a 
continuous distribution for which the p.d-f. f(x) is unknown, but may be assumed 
to be a symmetric function with respect to some unknown point 0 (—oo < @ < ov). 
Because of this symmetry, the point @ will be a median of the unknown distribution. 
We shall estimate the value of @ from the observations Xj, ..., X). 

If we know that the observations actually come from a normal distribution, then 
the sample mean X,, will be the M.L.E. of 6. Without any strong prior information 
indicating that the value of 6 might be quite different from the observed value of X,,, 
we may assume that X,, will be a reasonable estimator of 6. Suppose, however, that 
the observations might come from a distribution for which the p.d.f. f(«) has much 
thicker tails than the p.d-f. of a normal distribution; that is, suppose that as x + oo or 
x —> —oo, the p.d.f. f(x) might come down to 0 much more slowly than does the p.d.f. 
of anormal distribution. In this case, the sample mean X,, may be a poor estimator of 
6 because its M.S.E. may be much larger than that of some other possible estimator. 


Shifted Cauchy Sample. If the underlying distribution is the Cauchy distribution cen- 
tered at an unknown point 0, as defined in Example 7.6.5, then the M.S.E. of X,, will 
be infinite. In this case, the M.L.E. of 6 will have a finite M.S.E. and will be a much 
better estimator than X,,. In fact, for a large value of n, the M.S.E. of the M.L.E. 
is approximately 2/n, no matter what the true value of 6 is. However, as pointed 
out in Example 7.6.5, this estimator is very complicated and must be determined by 
a numerical calculation for each given set of observations. A relatively simple and 
reasonable estimator for this problem is the sample median, which was defined in Ex- 
ample 7.9.3. It can be shown that the M.S.E. of the sample median for a large value 
of n is approximately 2.47/n when the data have the Cauchy distribution. < 


It follows from Example 10.7.2 and the preceding discussion that if we could 
assume that the underlying distribution is normal or nearly normal, then we might 
use the sample mean as an estimator of 6. On the other hand, if we believe that the 
underlying distribution is Cauchy or nearly Cauchy, then we might use the sample 
median. However, we typically do not know whether the underlying distribution is 
nearly normal, is nearly Cauchy, or does not correspond closely to either of these 
types of distributions. For this reason, we should try to find an estimator of 6 that will 
have a small M.S.E. for several different possible types of distributions. An estimator 
that performs well for several different types of distributions, even though it may not 
be the best available estimator for any particular type of distribution, is called a robust 
estimator. In this section, we shall define a class of distributions called contaminated 
normals that we shall use for assessing the performance of various estimators. We 
shall also introduce special types of robust estimators known as trimmed means and 
M-estimators. The term robust was introduced by G. E. P. Box in 1953, and the term 
trimmed mean was introduced by J. W. Tukey in 1962. However, the first mathematical 
treatment of trimmed means was given by P. Daniell in 1920. M-estimators were 
introduced by Huber (1964). 
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Definition 
10.7.1 


Figure 10.7 p.d.f’s of 
standard normal distribution 
and « = 0.05 contaminated 
normal with mean O and 
variance 100. 


Contaminated Normal Distributions 


One reason that experimenters might be hesitant to behave as if their data were 
sampled from a normal distribution is the possibility that random errors might occur 
in the data. Once in a while, a data value is recorded incorrectly or is collected under 
circumstances that are different from those under study. The one observation (or 
possibly a few) will have a distribution that might be much different from that of the 
majority of the observations. For example, suppose that the bulk of the data in which 
we are interested comprise a sample from the normal distribution with unknown 
mean yw and variance o”. But suppose that, for each observation, there is a small 
probability « that the observation actually comes from a different distribution with 
p.d.f. g. That is, the p.d.f. of our observable data is actually 


f(x) = -—©)Qn02)-V? exp(— 52518 = uF) + €g(x). (10.7.1) 


Contaminated Normal Distributions. A distribution whose p.d.f. has the form of 
Eq. (10.7.1) is called a contaminated normal, and the distribution with p.d.f. g is 
called the contaminating distribution. 


If the contaminating distribution in Eq. (10.7.1) has a high variance or has a 
mean very different from jz, there is a good chance that the observations we obtain 
from the contaminating distribution will be far away from the other observations. 
In order for an estimator to perform well for a large class of contaminated normal 
distributions, the estimator will have to be somewhat insensitive to one (or a few) 
observation(s) not close to the bulk of the data. Obviously, if « > 1/2, it becomes 
difficult to tell which distribution is contaminating which. So we shall assume that 
€ < 1/2. A simple example of a contaminated normal distribution is one in which g 
is the p.d.f. of a normal distribution with mean pz and variance 10007. In this case, 
Eq. (10.7.1) becomes 


fe) =e) no?) exp(- 55h - uP) 


+ €(200207)~ 1/2 exp(- le= uP) (10.7.2) 


20002 


Figure 10.7 shows a standard normal p.d.f. together with the p.d-f. of a contam- 
inated normal of the form of Eq. (10.7.2) with « =0, 0 = 1, and e = 0.05. The two 


— Normal 
0.4 --- Contaminated 


Figure 10.8 Sample size 
times variances of sample 
median and sample mean 
for a random sample from 
a contaminated normal dis- 
tribution with the p.d.f. in 
Eq. (10.7.2) witho =lasa 
function of the amount of 
contamination ¢. The line for 
the median uses the asymp- 
totic result Eq. (10.7.3). 
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p.d.f.’s are quite similar, but we shall see shortly how much effect the contamination 
can have on the problem of estimation. 

Two important properties of the distribution of an estimator of the median are its 
mean and its variance. In the situation in which the data have the p.d.f. (10.7.2), both 
the sample mean and the sample median have mean pw. Next, we shall compare the 
variances of these two estimators when the data are a random sample with the p.d.f. 
(10.7.2). The variance of the average of a sample of size n is (1 + 99€)o?/n. (You can 
prove this in Exercise 7.) The variance of the sample median is a bit more difficult 
to compute. However, using the large-sample properties that will be introduced on 
page 676, we can see that the variance is approximately 


1  _o% 50x 
Anf2(u)  n (10 — 9e)2’ 


(10.7.3) 


Figure 10.8 shows a comparison of (50z)/(10 — Se)? and (1+ 99e) for 0 <e <0.5. 
Notice that the variance of the sample median is only slightly larger than the variance 
of the sample mean for € < 0.0058, and it is substantially smaller for € in the range 
of 0.01 to 0.5. For example, if € = 0.05 (as in Fig. 10.7), the variance of the sample 
median is only about 29 percent of the variance of the sample mean. 


Trimmed Means 


Suppose that X,,..., X, form a random sample from an unknown continuous dis- 
tribution for which the p.d.f. f(x) is assumed to be symmetric with respect to an 
unknown point 6. For this discussion, we shall let Y; < Y, <--- < Y, denote the or- 
der statistics of the sample. The sample mean X,, is simply the average of these n 
order statistics. However, if we suspect that the p.d.f. f(x) might have thicker tails 
than a normal distribution has, then we may wish to estimate 6 by using a weighted 
average of the order statistics, which assigns less weight to the extreme observations 
such as ¥;, Y>, Y,,_1, and Y,,, and assigns more weight to the middle observations. The 
sample median is a special example of a weighted average. When n is odd, it assigns 
zero weight to every observation except the middle one. When n is even, it assigns 
the weight 1/2 to each of the two middle observations and zero weight to all other 
observations. 

The following class of estimators also consists of weighted averages of the order 
statistics. 
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Definition 
10.7.2 


Definition 
10.7.3 


Definition 
10.7.4 


Trimmed Means. For each positive integer k such that k < n/2, ignore the k smallest 
observations Y;,..., Y, and the & largest observations Y,,, Y,_1,..-.,. Yn—41 in the 
sample. The average of the remaining n — 2k intermediate observations is called the 
kth level trimmed mean. 


Clearly, the kth level trimmed mean can be represented as a weighted average of the 
order statistics having the form 


n—k 


: YY. (10.7.4) 
n—2k . 
i=k+1 


The sample median is an example of a trimmed mean. When n is odd, the sample 
median is the [(n — 1)/2]th level trimmed mean. When 7 is even, it is the [(n — 2)/2]th 
level trimmed mean. In either case, the sample median is the kth level trimmed mean, 
where k = [(n — 1)/2] is the largest integer less than or equal to (n — 1)/2. 


Robust Estimation of Scale 


In addition to the median of a distribution, there are other parameters that might 
be worth estimating even when we are not willing to model our data as arising from 
a particular parametric family. For example, scale parameters might be valuable for 
giving an idea of how spread out a distribution is. The standard deviation, if it exists, 
is one such measure. The general class of scale parameters is defined here. 


Scale Parameters. An arbitrary parameter o is a scale parameter for the distribution 
of X if, for all a > 0 and all real b, the corresponding parameter for the distribution 
of aX + bisao. 


Although the standard deviation is a scale parameter, there are many distributions 
(such as the Cauchy) for which the standard deviation does not exist. There are 
alternative measures of spread to the standard deviation that exist and are finite 
for all distributions. 

One scale parameter that exists for every distribution is the interquartile range 
(IQR) as defined in Definition 4.3.2 on page 233. For example, if F is the normal 
distribution with mean ju and variance o”, then the IOR is 2®~!(0.75)o = 1.3490 (see 
Exercise 15). The IQR of the Cauchy distribution is 2 (see Example 4.3.9). It is not 
difficult to show (see Exercise 11) that if the IQR of X iso andifa > 0,thenaX +b 
has IQR equal to ao. An estimator of the IOR is the sample IOR, the difference 
between the 0.75 and 0.25 sample quantiles. (Sample quantiles are just quantiles of 
the sample c.d.f.) 

Another scale parameter that exists for every random variable X is the median 
absolute deviation 


Median Absolute Deviation. The median absolute deviation of a random variable X is 
the median of the distribution of |X — m|, where m is the median of X. 


If the distribution of X is symmetric around its median, then the median absolute 
deviation is one-half of the IQR. For asymmetric distributions, the median absolute 
deviation is the half-length of the symmetric interval around the median that contains 
50 percent of the distribution, while the IQR is the length of the interval around 
the median that contains half of the distribution below the median and half of the 
distribution above the median. For example, if X has the x? distribution with five 


Definition 
10.7.5 
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degrees of freedom, the IOR is 3.95, while the median absolute deviation is 1.895, a 
little less than one-half of the IQR. An estimator of the median absolute deviation is 
the sample median absolute deviation. The sample median absolute deviation is the 
sample median of the values |X; — M,,|, where M,, is the sample median of X,,..., X,,. 

Two other scale parameters that are useful are the IQR divided by 1.349 and the 
median absolute deviation divided by 0.6745. These parameters were chosen to have 
the property that if the data come a normal distribution, then these parameters equal 
the standard deviation (see Exercise 15). Typical estimators of these parameters are 
the sample IQR divided by 1.349 and the sample median absolute deviation divided 
by 0.6745. 


M-Estimators of the Median 


The sample mean is heavily influenced by one extreme observation. For example, if 
one observation x in a sample of size n is replaced by x + A, the sample mean changes 
by A/n. If A is large, this will be a big change. The sample median, on the other hand, 
is influenced very little, or not at all, by a change in one observation. However, the 
sample median is inefficient in that it makes use of very few of the observed values. 
Trimmed means are one attempt to compromise between the sample median and the 
sample mean by forming estimators that make use of more than just the one or two 
observations in the middle of the sample while maintaining insensitivity to extreme 
observations. There are other estimators that also attempt to effect this same type of 
compromise. These other estimators are M.L.E.’s of 6 under different assumptions 
about the p.d.f. of the observations. 

The sample mean is the M.L.E. of 6 if we assume that X,,..., X,, form arandom 
sample from a normal distribution with mean (and median) 6 and arbitrary variance. 
The sample median is also an M.L.E. It is the M.L.E. of 6 if we assume that X;,..., X,, 
form a random sample from one of the following distributions. 


Laplace Distributions. Leto > 0 and 6 be real numbers. The distribution whose p.d_f. is 
f(x|0,0) = J -1s-01/0 (10.7.5) 
20 
is called the Laplace distribution with parameters 0 and o. 


See Exercise 9 to prove that the M.L.E. of @ is indeed the sample median when the 
sample comes from a Laplace distribution. 

In order to see why the M.L.E.’s for the Laplace and normal distributions are 
so different, we can examine the two equations that the M.L.E.’s solve for those two 
cases. These equations say that the derivatives with respect to 6 of the logarithms of 
the respective likelihoods must equal 0. In both cases, the derivative of the logarithm 
of the likelihood is the sum of n terms, one for each observation. For the normal case, 
the term corresponding to an observation x; is (x; — 9)/o7. For the Laplace case, the 
term corresponding to an observation x; equals 1/o if @ < x;, and it equals —1/o if 
0 > x;. The derivative does not exist at 96 = x;. We illustrate these two derivatives in 
Fig. 10.9 for the cloud-seeding data introduced in Example 8.3.2. A change of size 
A in a single observation will vertically shift the entire normal distribution line in 
Fig. 10.9 by A/[no?]. The same-sized change in the same observation will only affect 
the Laplace graph in Fig. 10.9 in the vicinity of the changed observation. The actual 
values of the most extreme observations do not affect where the graph crosses 0. 

It would be nice to have a compromise between these two types of behavior 
without arbitrarily discarding a fixed amount of data. We would like the derivative 
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Figure 10.9 Derivatives of 
the logarithms of the Laplace 
and normal likelihoods (with 
o = 1) using the cloud- 
seeding data. 
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of the logarithm of the likelihood to be approximately proportional to }°(x; — @) 
for 6 near the middle of the data, where the summation is only over the middle 
observations. This will allow the estimator to make use of more data than just the very 
middle observation. Also, we would like the derivative to flatten out like the Laplace 
case for 6 near the extremes so that the actual values of the extreme observations do 
not affect the estimate. A p.d.f. with these properties is the following: 


g(x10, 0) = cpe PV), (10.7.6) 
where o is a scale parameter, 


—0.5y? if -k<y <k, 


h, = 
KO) 0.5k2 —kly| otherwise, 


and c, is a constant that makes the integral of g equal to 1. The number k must 
be chosen somehow, usually to reflect some idea of how far from 6 we think that 
extreme observations are likely to be. The derivative of the logarithm of g;(x|6, 7) 
with respect to 6 is linear in 6 for |6 — x| < ko, but it flattens out like the derivative of 
the logarithm of the Laplace p.d.f. does when |@ — x| > ka. Now, we see that k can be 
chosen to reflect how many multiples of o a data value can be away from 6 before we 
think that it starts to lose importance for estimating 0. Typical choices are 1 <k <2.5. 
If we suppose that X,,..., X, form arandom sample from a distribution with p.d.f. 
gx(x|6, o), the M.L.E. of @ will be a compromise between the sample median and the 
sample mean. 


M-Estimators. The M.L.E. of 6 under the assumption that the data have p.d-f. g; in 
Eq. (10.7.6) is called an M-estimator. 


M-estimators were proposed as robust estimators by Huber (1977). The name derives 
from the fact that they are found by maximizing a function that might not be the 
likelihood. 

The M-estimator found by maximizing []}_, g,(x;|@, 0) cannot be obtained in 
closed form, but there is a simple iterative algorithm for finding it if we can first 
estimate o. Typically, one replaces o by & equal to one of the robust scale estimates 
described earlier in this section. One popular choice is the sample median absolute 
deviation divided by 0.6745. Treating []/_, g,(x;|9, ©) as a function of 6, we can take 
the derivative of the logarithm and set it equal to 0 to try to find the maximum. The 


Example 
10.7.3 


10.7 Robust Estimation 673 


derivative of the logarithm is — }°"_, wx ([x; — 4]/&)/6, where 


—k ify <-k, 
u)= yy if-ks<y<k, 
k ify>k. 


Typically, one solves }*"_, Wj ([x; — @]/) =0 as follows: Rewrite the equation as 
Yr, Wi (9) (x; — 8) = 0, where w; (9) is defined as 


vidi -— OVO) 
at ifx, £0, 
w=} x, 8 aie 
1 if xi = 0. 
Then 6 = )°"_, w;(0)x;/ >-7_, w;(@) solves the equation. Clearly, we need to know 6 
before we can compute w;(@), but we can solve the equation iteratively using these 
steps: 


1. Pick a starting value 6) such as the sample median and set j = 0. 
2. Let 


Oj = 


es Wj (6;) 


3. Increment j to 7 + 1, and return to step 2. 


This procedure will typically converge in a small number of iterations to the M- 
estimate 0. 

The iterative procedure actually makes it clear why @ is robust and why it is 
a compromise between the sample mean and the sample median. Note that 6 is a 
weighted average of the values x1, ... , x,. The weight on x; is proportional to w;(@). 
If |x; — 6| <k6, then w;(6) = 1/6. If |x; — 6| > k6, then w;(6) =k/|x; — 6|, which 
decreases as x; becomes more extreme. If 6 is near the middle of the distribution (as 
we would hope it would be), then the observations near the middle of the distribution 
get more weight in the estimate, and those far away get less weight. 


Note: M-Estimators and Symmetric Distributions. At the start of this section, we as- 
sumed that the unknown p.d.f. f of the data was symmetric about an unknown value 
0, which must be the median of the distribution. The M-estimator described above 
can be calculated even if we do not assume that the data come from a symmetric dis- 
tribution. However, the M-estimator will not necessarily estimate the median of the 
distribution if the distribution is not symmetric. Instead, the M-estimator estimates 


the number y such that 
E E (= — r)| =i (10.7.7) 
Oo 


If the distribution of X; is symmetric around 6, then y = 6 will solve Eq. (10.7.7). If 
the distribution of X; is not symmetric, then some number other than the median 
might solve Eq. (10.7.7). 


Rain from Seeded Clouds. Using the seeded cloud data again, we shall find the value 
of the M-estimator with k = 1.5. We start with the sample median of the log-rainfalls, 
09 = 5.396. We also use & equal to the median absolute deviation 0.7318 divided by 
0.6745, that is, 6 = 1.085. The six smallest and three largest observations are not 
within 1.56 of the sample median. These nine observations each get less weight 
than the other 17 observations in the calculation of the next iteration. For example, 
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the smallest observation is 1.411, which gets weight 1.5/|1.411 — 5.396| = 0.3764, 
compared to weight 0.9217 for the 17 central observations. The weighted average 
of the observations is then 6, = 5.315. We repeat the weighting and averaging until 
we get no change. After 10 more iterations, we get 0;; = 5.283, which agrees with 6} . 

< 


Note: Simultaneous M-Estimators Exist for the Median and Scale Parameters. It is 
possible to estimate the median and a scale parameter simultaneously using a method 
very similar to that described for M-estimators. That is, instead of just picking a value 
for ¢ in the M-estimator algorithm, we can construct a more complicated algorithm 
that estimates both the median and a scale parameter. Readers interested in more 
examples of robust procedures can read Huber (1981) and Hampel et al. (1986). 


Comparison of the Estimators 


We have mentioned the desirability of using a robust estimator in a situation in which 
it is suspected that the observations X;,..., X, may form a random sample from a 
distribution for which the tails of the p.d.f. are thicker than the tails of the p.d-f. of 
a normal distribution. The use of a robust estimator is also desirable when a few of 
the observations in the sample appear to be unusually large or unusually small. In 
this situation, a statistician might suspect that most of the observations in the sample 
came from one normal distribution, whereas the few extreme observations may have 
come from a different normal distribution with a much larger variance than the first 
one. (This is the contaminated normal case.) The extreme observations, which are 
called outliers, will substantially affect the value of X, and make it an unreliable 
estimator of 0. Since the values of these outliers would be given less weight in a 
robust estimator, the robust estimator will usually be a more reliable estimator than 


Xin 
It is acknowledged that a robust estimator will perform better than X,, in a 
situation of the type just described. However, if X;,..., X,, actually do form a 
random sample from a normal distribution, then X,, will perform better than a robust 
estimator. Since we are typically not certain which situation obtains in a particular 
problem, it is important to know how much larger the M.S.E. of a robust estimator 
will be than the M.S.E. of X,, when the actual distribution is normal. In other words, 
it is important to know how much is lost if we use a robust estimator when the actual 
distribution is normal. We shall now consider this question. 
When X;,..., X,, formarandom sample from the normal distribution with mean 
6 and variance o”, the probability distribution of X,, and the probability distribution 
of each of the robust estimators described in this chapter will be symmetric with 
respect to the value 6. Therefore, the mean of each of these estimators will be 6, 
the M.S.E. of each estimator will be equal to its variance, and this M.S.E. will have 
a certain constant value for each estimator regardless of the true value of 6. The 
values of several of these M.S.E.’s for a normal distribution when the sample size n 
is 10 or 20 are presented in Table 10.38. The values in Table 10.38 are from Andrews 
et al. (1972). They were computed using simulation methods that will be introduced 
in Chapter 12. It should be noted that when n = 10, the trimmed mean for k = 4 and 
the sample median are the same estimator. 
It can be seen from Table 10.38 that when the underlying distribution is actually 
a normal distribution, the M.S.E.’s of the M-estimator and the trimmed means are 
not much larger than the M.S.E. of X,,. In fact, when n = 20, the M.S.E. of the second- 
level trimmed mean (k = 2), in which four of the 20 observed values in the sample 
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Table 10.38 Comparison of M.S.E.’s for sample mean and 
several robust estimators. The data have a normal 
distribution with variance o”. The M.S.E. is the 
tabulated value times o?/n. The M-estimator uses 
k =1.5 and 6 equal to the sample median absolute 
deviation divided by 0.6745. 


Estimator n=10 n= 20 
Sample mean X,, 1.00 1.00 
Trimmed mean for k = 1 1.05 1.02 
Trimmed mean for k = 2 1.12 1.06 
Trimmed mean for k = 3 121 1.10 
Trimmed mean for k = 4 1.37 1.14 
Sample median 1.37 1.50 
M-estimator 1.05 1.05 


Table 10.39 Comparison of M.S.E.’s for sample mean and 
several robust estimators. The data have a Cauchy 
distribution. The M.S.E. is the tabulated value 
divided by n. The M-estimator uses k = 1.5 and 
& equal to the sample median absolute deviation 


divided by 0.6745. 

Estimator n=10 n=20 
Sample mean X,, ora) ora) 
Trimmed mean for k = 1 27.22 23.98 
Trimmed mean for k = 2 8.57 7.32 
Trimmed mean for k =3 3.86 4.57 
Trimmed mean for k = 4 3.66 3.58 
Sample median 3.66 2.88 
M-estimator 6.00 4.50 


are omitted, is only 1.06 times as large as the M.S.E. of X,,. Even the M.S.E. of the 
sample median is only 1.5 times that of X,,. These values illustrate the price of using 
a robust estimator when one is not needed. 

We shall now consider the improvement in the M.S.E. that can be achieved by 
using a robust estimator when the underlying distribution is not normal. If X;,..., X, 
form a random sample of size n from a Cauchy distribution, then the M.S.E. of X,, is 
infinite. The M.S.E.’s of robust estimators for a Cauchy distribution when the sample 
size n is 10 or 20 are given in Table 10.39. The values in Table 10.39 are from Andrews 
et al. (1972). 

Finally, the M.S.E.’s for two contaminated normal distributions are illustrated 
in Table 10.40. The two distributions have p.d.f.’s as in Eq. (10.7.2) with e = 0.05 and 
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Table 10.40 Comparison of M.S.E.’s for sample mean and several 
robust estimators. The data consist of n = 20 observa- 
tions from a contaminated normal distribution with 
p.d.f. (10.7.2) using « = 0.05 and « = 0.10. The M.S.E. 
is the tabulated value divided by n. The M-estimator 
uses k = 1.5ando equal to the sample median absolute 
deviation divided by 0.6745. 


Estimator € = 0.05 €=0.1 
Sample mean X,, 5.95 10.90 
Trimmed mean for k = 1 1.87 3.92 
Trimmed mean for k = 2 1.32 2.01 
Trimmed mean for k = 3 1.27 1.57 
Trimmed mean for k = 4 1.29 1.50 
Sample median 1.62 1.81 
M-estimator 127 1.58 


€ = 0.1. The values in Table 10.40 were computed using simulation methods described 
in Chapter 12. 

It can be seen from Tables 10.39 and 10.40 that the M.S.E. of a robust estimator 
can be substantially smaller than that of X,,. When a trimmed mean or an M-estimator 
is to be used as an estimator of 0, it is evident that a specific value of k must be chosen. 
No general rule for choosing k will be best under all conditions. If there is reason to 
believe that the p.d-f. f(x) is approximately normal, then 6 might be estimated by 
using a trimmed mean, which is obtained by omitting about 10 or 15 percent of the 
observed values at each end of the ordered sample. Alternatively, an M-estimator 
with k =2 or 2.5 could be used. If the p.d.f. f(«) might be far from normal or if 
several of the observations might be outliers, then the sample median might be used 
to estimate 0, or one could use an M-estimator with k = 1 or 1.5. 

We could also compare various scale estimators in a similar fashion. Such a com- 
parison is complicated by the fact that there are several choices of scale parameter to 
estimate, such as standard deviation, IOR, and median absolute deviation. We shall 
not present such a comparison here. 


Large-Sample Properties of Sample Quantiles 


Earlier in this section, we made use of the sample median as well as the sample 
0.25 and 0.75 quantiles to estimate the median and scale features of a distribution. 
The distributions of these, and other, sample quantiles are difficult to derive exactly. 
Approximations are available to the distributions of sample quantiles if the sample 
sizes are large. It can be shown that if X;,..., X,, form a large random sample from 
a continuous distribution for which the p.d.f. is f(x) and for which there is a unique 
p quantile @,, then the distribution of the sample p quantile will be approximately a 
normal distribution. Specifically, it must be assumed that f(6,) > 0. 


Theorem 
10.7.1 
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Asymptotic Distribution of Sample Quantile. Under the conditions above, let Oy.n 


denote the sample p quantile. Then, as n — ov, the c.d.f. of n!/ 20 5 —6,) will 
converge to the c.d.f. of the normal distribution with mean 0 and variance p(1 — 


P)/f? Gp). 


In other words, when n is large, the distribution of the sample p quantile 6 p,n Will be 
approximately the normal distribution with mean @,, and variance p(1— p)/[nf °6,)]- 

Also, suppose that By, , denotes the sample g quantile for some q > p, and 
suppose that 6, is the unique q quantile of the distribution of the data. Then the joint 
distribution of (6, ,, 9),,) is approximately the bivariate normal distribution with 
means 6, and 6,, variances p(1 — p)/[nf?@,)] and q(1— q)/|nf?(,q)], and covariance 
p(l—4q)/[nf @,) f (@,)]. See Schervish (1995, section 7.2) for a rigorous derivation of 
these results. 


Summary 


We have introduced a number of estimators of the median and scale parameters that 
are more robust than the sample average and sample standard deviation. To say that 
the new estimators are more robust, we mean that they perform well compared to 
the old estimators, in terms of M.S.E., regardless of which distribution (in some large 
class) gives rise to the data. The robust estimators of the median include trimmed 
means, the sample median, and M-estimators obtained by maximizing a function 
that is similar to a likelihood function. Robust estimators of scale include the sample 
interquartile range (IOR), the sample median absolute deviation, and multiples of 
these that are designed to estimate the standard deviation when the data come from 
a normal distribution. 


Exercises 


1. Suppose that a sample comprises the 15 observed val- 
ues in Table 10.41. Calculate the values of (a) the sample 
mean, (b) the trimmed means for k = 1, 2, 3, and 4, (c) the 
sample median, and (d) the M-estimator with k = 1.5 and 
6 equal to the sample median absolute deviation divided 
by 0.6745. 


Table 10.41 Data for Exercise 1 


23.0 215 63.0 
22.5 2.1 22.1 
22.4 2.2 21.7 
21.7 22.2 22.9 
21.3 21.8 22.1 


2. Suppose that a sample comprises the 14 observed val- 
ues in Table 10.42. Calculate the values of (a) the sample 
mean, (b) the trimmed means for k = 1, 2, 3, and 4, (c) the 


sample median, and (d) the M-estimator with k = 1.5 and 
& equal to the sample median absolute deviation divided 
by 0.6745. 


Table 10.42 Data for Exercise 2 


1.24 0.36 0.23 

0.24 1.78 —2.00 
—0.11 0.69 0.24 

0.10 0.03 0.00 
—2.40 0.12 


3. Suppose that a random sample of n = 100 observa- 
tions is taken from the normal distribution with unknown 
mean @ and known variance 1, and let O5.n denote the 
sample median. Determine (approximately) the value of 
Pr(\65,, — 9| < 0.1). 
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4. Suppose that a random sample of n = 100 observations 
is taken from the Cauchy distribution centered at an un- 
known point @, and let 65 ,, denote the sample median. 


Determine (approximately) the value of Pr(l65,p —O|< 
0.1) 


5. Let f(x) denote the p.d-f. of the contaminated normal 
distribution given in Eq. (10.7.1) with « = 1/2, 07 =1, and 
g being the p.d.f. of a normal distribution with mean jz and 
variance 4. Suppose that 100 observations are selected at 
random from a distribution for which the p.d.f. is f(x). 
Determine the M.S.E. of the sample mean and (approxi- 
mately) the M.S.E. of the sample median. 


6. Use the data in Table 10.6 on page 640. We want an 
estimate of the median of the logarithms of sulfur dioxide. 
Find (a) the sample mean, (b) the trimmed means for 
k =1, 2, 3, and 4, (c) the sample median, and (d) the M- 
estimator with k = 1.5 and G equal to the sample median 
absolute deviation divided by 0.6745. 


7. Suppose that X1,..., X, are iid. with a distribution 
that has the p.d.f. in Eq. (10.7.2). Let X, = 4+ °"_, Xj. 


a. Prove that E(X,,) =u. 
b. Prove that Var(X,,) = (1+ 99€)o7/n. 


8. If Fig. 10.8 were extended all the way to « = 1, the vari- 
ance of the sample median would rise above the variance 
of the sample mean. Indeed, the ratio of the two variances 
would be the same at € = 1 as it is at e = 0. Explain why 
this should be true. 


9. Assume that Xj,..., X, form a random sample from 
the distribution with p.d-f. given in Eq. (10.7.5). Prove that 
the M.L.E. of @ is the sample median. (Hint: Let X have 
c.d.f. equal to the sample c.d.f. of X;,..., X,. Then apply 
Theorem 4.5.3.) 


10. Let X),..., X, bei.id. with the p.d-f. in Eq. (10.7.5). 
Assume that o is known. Let 6 be between two of the 
observed values x;,...,x,. Prove that the derivative of 
the logarithm of the likelihood at @ equals 1/o times the 


difference between the number of observations greater 
than 6 and the number of observations less than 0. 


11. Let X be a random variable with a continuous distri- 
bution such that the interquartile range (IQR) is o. Prove 
that the IOR of aX + bis ao for all a > 0 and all b. 


12. Let X be a random variable with a continuous distri- 
bution such that the median absolute deviation is 0. Prove 
that the median absolute deviation of aX + bis ao for all 
a>Oandallb. 


13. Find the median absolute deviation of the Cauchy 
distribution. 


14. Let X have the exponential distribution with param- 
eter A. Prove that the median absolute deviation of X is 
smaller than one-half of the IQR. (You can do this with- 
out actually calculating the median absolute deviation.) 


15. Let X have a normal distribution with standard devi- 
ationo. 


a. Prove that the IOR is 26—1(0.75)o. 


b. Prove that the median absolute deviation is 
&1(0.75)o. 


16. Darwin (1876, p. 16) reported the results of an ex- 
periment in which he grew 15 pairs of Zea mays (corn) 
plants. Each pair consisted of a self-fertilized and a cross- 
fertilized plant that were grown in the same pot. The num- 
bers below are the differences between heights (in eighths 
of an inch) of the two plants in each pair (cross-fertilized 
minus self-fertilized). 


49, —67, 8, 16, 6, 23, 28, 41, 14, 29, 56, 24, 75, 60, —48 


Find the (a) the sample mean, (b) the trimmed means for 
k =1, 2, 3, and 4, (ec) the sample median, and (d) the M- 
estimator with k = 1.5 and G equal to the sample median 
absolute deviation divided by 0.6745. 


17. Let X;,..., X, be a large random sample from a dis- 
tribution with p.d.f. f. Assume that f is symmetric about 
the median of the distribution. Find the large-sample dis- 
tribution of the sample IOR. 


* 10.8 Sign and Rank Tests 


In this section, we describe some popular nonparametric tests for hypotheses about 
the median of a distribution or about the difference between two distributions. 


One-Sample Procedures 


Example 
10.8.1 


Calorie Counts in Hot Dogs. Consider the n = 20 calorie counts for beef hot dogs given 
in Exercise 7 in Sec. 8.5. Suppose that we are interested in testing hypotheses about 


the median calorie count, but we are not willing to assume that the calorie counts 
follow a normal distribution or any other familiar distribution. Are there methods 


Example 
10.8.2 
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that are appropriate when we are not willing to make assumptions about the form of 
the distribution? S| 


Suppose that X,,..., X,, form a random sample from an unknown distribution. 
In Chapter 9, we considered the case in which the form of the unknown distribution 
was known, but there were some specific parameters that were still unknown. For 
example, the distribution might be a normal distribution with unknown mean and/or 
variance. Now we shall assume only that the distribution is continuous. Since we shall 
not assume that the distribution of the data has a mean, then we cannot test hypothe- 
ses about the mean of the distribution. However, every continuous distribution has 
a median p that satisfies Pr(X; < 4) = 0.5. The median is a popular measure of loca- 
tion for general distributions, and we shall now present a test procedure for testing 
hypotheses of the form 


AP is (10.8.1) 
Ay: LL > Lo. 
The test is based on the following simple fact: yz < zp if and only if Pr(X; < 49) > 0.5. 
Fori=1,...,n, let Y¥, =1if X; < uo, and let Y; = 0 otherwise. Define p = Pr(Y; = 
1). Then testing whether jz < yp is equivalent to testing whether p > 0.5. Since 
X1,...,X, are independent, then so too are Yj,..., Y,. This makes Y,..., Y,, 


a random sample from the Bernoulli distribution with parameter p. We already 
know how to test the null hypothesis that p > 0.5. (See Example 9.1.9.) We compute 
W=Y,+---+Y, and reject the null hypothesis if W is too small. To make the test 
have level of significance ap, choose c so that 


“(n i” oe fu 1" 
eg < any (ae 
(0) (2) <= 2G) (a) 
w=0 w=0 
Then the test would reject Hy if W <c. 
The test that we have just described is called the sign test because it is based 


on the number of observations for which X; — 4p is negative. A similar test can be 
constructed if we wish to test the hypotheses 


Ho: = Lo; 
Ay wb FU. 


Once again, let p = Pr(X; < 9). The null hypothesis Hp is now equivalent to p = 0.5. 
To perform the test at level of significance aj, we would choose a number c such that 


S(n 1\" a cee 1\" 
¥ (2) G) 2X) G): 
w/ \2 2 w/ \2 
w=0 w=0 
We would then reject Ho if either W < cor W >n — c. We use the symmetric rejection 


region because the binomial distribution with parameters n and 1/2 is symmetric 
about n/2. 


Calorie Counts in Hot Dogs. Consider again the calorie counts for beef hot dogs in 
Example 10.8.1. Let w stand for the median of the distribution of calories in beef hot 
dogs. Suppose that we are interested in testing the hypotheses Hp : 4 = 150 versus 
H,: 4 #150. Since 9 of the 20 calorie counts are below 150, we have W = 9. The two- 
sided p-value for this observation is 0.8238, so we would not reject the null hypothesis 
at level ap unless ag > 0.8238. < 
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Example 
10.8.3 


The power function of the sign test is easy to compute for each value of p = 
Pr(X; < uo). For example, for the one-sided test of the hypotheses (10.8.1), the 
power is 


Pr(W <c)= > (")ona — py". 


w=0 


Comparing Two Distributions 


Comparing Copper Ores. Consider again the comparison of copper ores in Exam- 
ple 9.6.5. Suppose that we are not comfortable assuming that the distributions of 
copper ores are normal distributions. Can we still test hypotheses about whether the 
distributions are the same or whether they have the same medians? < 


Next, we shall consider a problem in which a random sample of m observations 
X1,..., X is taken from a continuous distribution for which the c.d.f. F(x) is 
unknown, and an independent random sample of n observations Y;,..., Y,, is taken 
from another continuous distribution for which the c.d.f. G(x) is also unknown. We 
desire to test the hypotheses 


Ho: F=G 


ie Boe (10.8.2) 


One way to test the hypotheses (10.8.2) is to use the Kolmogorov-Smirnov test 
for two samples described in Sec. 10.6. Furthermore, if we are willing to assume 
that the two samples are actually drawn from normal distributions with the same 
unknown variance, then testing the hypotheses (10.8.2) is the same as testing whether 
two normal distributions have the same mean. Therefore, under this assumption, we 
could use a two-sample ¢ test as described in Sec. 9.6. 

In this section we shall present another procedure for testing the hypotheses 
(10.8.2). This procedure, which was introduced separately by F. Wilcoxon and by 
H. B. Mann and D. R. Whitney in the 1940s, is known as the Wilcoxon-Mann-Whitney 
ranks test. 


The Wilcoxon-Mann-Whitney Ranks Test In this procedure, we begin by arrang- 
ing the m +n observations in the two samples in a single sequence from the smallest 
value that appears in the two samples to the largest value that appears. Since all the 
observations come from continuous distributions, it may be assumed that no two of 
the m +n observations have the same value. Thus, a total ordering of these m +n 
values can be obtained. Each observation in this total ordering is then assigned a 
rank from 1 to m +n corresponding to its position in the ordering. 

The Wilcoxon-Mann-Whitney ranks test is based on the property that if the 
null hypothesis Hp is true and the two samples are actually drawn from the same 
distribution, then the observations Xj, ..., X,, will tend to be dispersed throughout 
the ordering of all m + n observations, rather than be concentrated among the smaller 
values or among the larger values. In fact, when H is true, the ranks that are assigned 
to the m observations X,,..., X,, will be the same as if they were a random sample 
of m ranks drawn at random without replacement from a box containing the m +n 
ranks 1,2,...,m-+n. 

Let S denote the sum of the ranks that are assigned to the m observations 
X1,..., Xj). Since the average of the ranks 1, 2,...,m-+n is (1/2)(m +n +1), it 


Example 
10.8.4 
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Table 10.43 Sorted data for Example 10.8.4 
Observed Observed 
Rank Value Sample Rank Value Sample 
1 2.120 y 10. =. 2.431 x 
2 2.153 y 112.556 x 
3 2.183 x 12 = 2.558 y 
4 2.213 y 132.587 y 
5 2.240 y 14 =. 2.629 x 
6 2.245 y 152.641 x 
7 2.266 y 16 = 2.715 x 
8 2.281 y 17. ~—.2.805 x 
9 2.336 y 18 = 2.840 x 


follows from the discussion just given that when Hp is true, 


m(m+n-+1) 


E(S) = ; (10.8.3) 
Also, it can be shown that when Ab is true, 
Cp Sng (10.8.4) 


12 


Furthermore, when the sample sizes m and n are large and Hp is true, the distribution 
of S will be approximately the normal distribution for which the mean and the vari- 
ance are given by Eqs. (10.8.3) and (10.8.4). The Wilcoxon-Mann-Whitney ranks test 
rejects Hp if the value of S deviates very far from its mean value given by Eq. (10.8.3). 
In other words, the test specifies rejecting Hp if |S — (1/2)m(m +n + 1)| > c, where 
the constant c is chosen appropriately. In particular, when the approximate normal 
distribution of S is used, the constant c = [Var(S)]!/ 2@—!(1 — ap/2) makes the test 
have level of significance a. 


Comparing Copper Ores. Consider again the comparison of copper ores in Exam- 
ple 10.8.3. Suppose that the m = 8 measurements in the first sample are 


2.183, 2.431, 2.556, 2.629, 2.641, 2.715, 2.805, 2.840, 
while the n = 10 measurements in the second sample are 
2.120, 2.153, 2.213, 2.240, 2.245, 2.266, 2.281, 2.336, 2.558, 2.587. 


The 18 values in the two samples are ordered from smallest to largest in Table 10.43. 
Each observed value in the first sample is identified by the symbol x, and each 
observed value in the second sample is identified by the symbol y. The sum S of 
the ranks of the 10 observed values in the first sample is found to be 104. 

Suppose that we use the normal distribution approximation. Then if Hp is true, S$ 
has approximately the normal distribution with mean 76 and variance 126.67. The 
standard deviation of S is therefore (126.67)!/* = 11.25. Hence, if Hp is true, the 
random variable Z = (S — 76)/(11.25) will have approximately the standard normal 
distribution. Since S = 104 in this example, it follows that Z = 2.49. The p-value 


682 Chapter 10 Categorical Data and Nonparametric Methods 


corresponding to this value of Z is 0.0128. Hence, the null hypothesis would be 
rejected at every level of significance ap > 0.0128. 4 


For small values of m and n, the normal approximation to the distribution of S 
will not be appropriate. Tables of the exact distributions of S for small sample sizes 
are given in many published collections of statistical tables. Many statistical software 
packages also calculate the c.d.f. and quantiles of the exact distribution of S. 


Note: Tests for Paired Data. Versions of the sign test and ranks test for paired data 
are developed in Exercises 1 and 15. 


Ties 


The theory of the Wilcoxon-Mann-Whitney ranks test is based on the as- 
sumption that all of the observed values of the X; and Y; will be distinct. Since the 
measurements in an actual experiment may be mae with only limited precision, 
however, there may actually be observed values that appear more than once. For 
example, suppose that a Wilcoxon-Mann- Whitney ranks test is to be performed, and 
it is found that X; = Y; for one or more pairs (i, j). In this case, the ranks test should 
be carried out twice. in the first test, for each pair with X; = Y;, it should be assumed 
that each X; < Y;. Inthe second test, assume that X; > Y;. If the tail areas found from 
the two tests are foul equal, then the ties are a eerce unimportant part of the 
data. If, on the other hand, the tail areas are quite different, then the ties can seriously 
affect the inferences that are to be made. In this case the data may be inconclusive. 


Example Calcium Supplements and Blood Pressure. Consider the data from Exercise 10 in 
10.8.5 Sec. 9.6, which we used to illustrate the Kolmogorov-Smirnov test in Example 10.6.4. 
The observed values —5 and —3 appear in both samples. First, we shall assign the 

smaller ranks to those values in the group that received the calcium supplement (the 

X;’s) and then assign the smaller rank to the placebo group (the Y;’s). For example, 

in the combined sample, the —3 values are the fifth, sixth, and seventh smallest. In 

the first test, we shall assign rank 5 to the X; that equals —3 and ranks 6 and 7 to 

the two Y;’s that equal —3. In the second test, we shall assign rank 7 to the X; that 

equals —3 and ranks 5 and 6 to the Y;’s. For the first test, the sum of the X ranks is 

123, and in the second test, the sum of the X ranks is 126. In this problem, m = 10 and 

n = 11, so the mean and variance of § when the null hypothesis is true are 110 and 

201.7, respectively. The two-sided tail areas corresponding to the two assignments 

are 0.36 and 0.26. Neither of these would lead to rejecting the null hypothesis at level 

dg unless wp > 0.26. < 


Other reasonable methods for handling ties have been proposed. When two or 
more values are the same, one simple method is to consider the successive ranks that 
are to be assigned to these values and then assign the average of these ranks to each 
of the tied values. When this method is used, the value of Var(.S) must be corrected 
because of the ties. 


“| Power of the Wilcoxon-Mann-Whitney Ranks Test 


The Wilcoxon-Mann-Whitney ranks test rejects the null hypothesis that the two 
distributions are the same when the sum S of the X ranks is either too large or too 
small. This would be a sensible thing to do if one thought that the most important 


Definition 
10.8.1 
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alternatives were those in which the X; values tended to be larger than the Y; values 
or those in which the X; values tended to be smaller than the Y; values. However, 
there are other situations in which F 4 G, but S tends to be close to the mean in 
Eq. (10.8.3). For example, suppose that all X;,..., X,, have the uniform distribution 
on the interval [0, 1] and Yj, ..., Y, have the following p.d.f.:: 


0.5 if-l<y<0Oorl<y <2, 

gi) | 0 otherwise. 
Then it is not difficult to show that E(S) is the same as Eq. (10.8.3) and Var(S) = 
m*n/4. In such a case, the power of the test (the probability of rejecting Hy) would 
not be much larger than the level of significance wy. Indeed, if one were concerned 
about alternatives of this sort, one would wish to reject Ho if the X ranks were too 
closely clustered regardless of whether they were large or small. 

The Wilcoxon-Mann-Whitney ranks test is designed to have high power when F 
and G have a special relationship to each other, defined next. 


Stochastically Larger. Let X be arandom variable with c.d.f. F, and let Y be arandom 
variable with c.d.f. G. Let F~! and G~! denote the respective quantile functions. We 
say that F is stochastically larger than G or, equivalently, that X is stochastically larger 
than Y if F~'(p) > G~|(p) for all 0 < p < 1; that is, every quantile of X is at least as 
large as the corresponding quantile of Y. 


It is easy to see that if X; is stochastically larger than Y;, then the ranks of the X;’s in 
the combined sample will tend to be at least as large as the ranks of the Y;’s. This will 
make large values of S more likely than small values. Similarly, if Y; is stochastically 
larger than X;, S will tend to be small. 

When neither X; nor Y; is stochastically larger than the other, it is difficult to 
make any general claim about the distribution of S. For large sample sizes, a normal 
approximation still holds for the distribution of S$, even when F 4 G. However, the 
mean and variance of S depend on the two c.d.f’s F and G. For example, using the 
result in Exercise 11, one can show that 


E(S) =nm Pr(X,>Y,) + wena (10.8.5) 
Using this same approach, one can also show that 
Var(S) = nm[ Pr(X1 > ¥4) — (m +n-1) Pr(X, = ¥)* (10.8.6) 


+ (n — 1) Pr(X, > 4, X= Yo) + (m—1) Pr(X, = %, X= Y,)]. 


In principle, all of these probabilities could be computed for each specific choice 
of F and G. For particular choices of F and G, one could use simulation methods 
(see Chapter 12) to approximate the necessary probabilities. After computing or 
approximating these probabilities, one can then approximate the power of the level 
ay Wilcoxon-Mann- Whitney ranks test as follows: First, recall that the test rejects the 
null hypothesis that F = G if S < c, or S > cy, where 


_ mm +n-+1) -@=3 (1 “0 [omncen ep 


Cy 
2 2 12 
m(m+n+1) 4 a \ | mn(m +n +1) He 
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Then the power of the test is 


cy — E(S) cz — E(S) 
. (Sans) — (Saas?) 


where E(S) and Var(S) are given by Eqs. (10.8.5) and (10.8.6), respectively. 


>, 


Summary 


“ 


The sign test was introduced as a nonparametric test for hypotheses about the median 
of an unknown distribution. The Wilcoxon-Mann-Whitney ranks test was developed 
as another nonparametric test for hypotheses about the equality of two c.d.f.’s. The 
Wilcoxon-Mann- Whitney ranks test was designed to have large power function when 
one of the two distributions is stochastically larger than the other. 


Exercises 


1. Suppose that (X, Y)),..., (X,, Y,) are iid. pairs of 
random variables with a continuous joint distribution. Let 
p = Pr(X; < Y;), and suppose that we want to test the 
hypotheses 

Ho: p <1/2, 


10.8.7 
Ay: p> 1/2. ( ) 


Describe a version of the sign test to use for testing these 
hypotheses. 


2. Consider again the data in Example 10.8.4. Test the 
hypotheses (10.8.2) by applying the Kolmogorov-Smirnov 
test for two samples. 


3. Consider again the data in Example 10.8.4. Test the hy- 
potheses (10.8.2) by assuming that the observations are 
taken from two normal distributions with the same vari- 
ance, and apply a ¢ test of the type described in Sec. 9.6. 


4. In an experiment to compare the effectiveness of two 
drugs A and B in reducing blood glucose concentrations, 
drug A was administered to 25 patients, and drug B was 
administered to 15 patients. The reductions in blood glu- 
cose concentrations for the 25 patients who received drug 
A are given in Table 10.44. The reductions in concentra- 
tions for the 15 patients who received drug B are given 
in Table 10.45. Test the hypothesis that the two drugs are 
equally effective in reducing blood glucose concentrations 
by using the Wilcoxon-Mann-Whitney ranks test. 


Table 10.44 Data for patients who receive 
drug A in Exercise 4 


0.35 1.12 1.54 0.13 0.77 
0.16 1.20 0.40 1.38 0.39 
0.58 0.04 0.44 0.75 0.71 
1.64 0.49 0.90 0.83 0.28 
1.50 1.73 1.15 0.72 0.91 


Table 10.45 Data for patients who receive 
drug B in Exercise 4 


1.78 1.25 1.01 
1.82 1.95 1.81 
0.68 1.48 1.59 
0.89 0.86 1.63 
1.26 1.07 1.31 


5. Consider again the data in Exercise 4. Test the hypoth- 
esis that the two drugs are equally effective by applying 
the Kolmogorov-Smirnov test for two samples. 


6. Consider again the data in Exercise 4. Test the hypoth- 
esis that the two drugs are equally effective by assuming 
that the observations are taken from two normal distribu- 
tions with the same variance and applying a t test of the 
type described in Sec. 9.6. 


7. Suppose that X;,..., X,, form a random sample of m 
observations from a continuous distribution for which the 
p.d.f. f(x) is unknown, and that Y,,..., Y,, form an inde- 
pendent random sample of n observations from another 
continuous distribution for which the p.d.f. g(x) is also 
unknown. Suppose also that f(x) = g(x — @) for —oo < 
x < oo, where the value of the parameter 6 is unknown 
(—oo < 6 < 00). Let F7! be the quantile function of the 
X;,’s, and let G~! be the quantile function of the Y 8s. Show 


that F-'(p) =0 + G~|(p) for all 0 < p <1. 


8. Consider again the conditions of Exercise 7. Describe 
how to carry out a one-sided Wilcoxon-Mann-Whitney 
ranks test of the following hypotheses: 

Ho: 0< 0, 

H 1: 6>0. 


9. Consider again the conditions of Exercise 7. Describe 
how to carry out a two-sided Wilcoxon-Mann-Whitney 
ranks test of the following hypotheses for a specified value 
of 6p: 


Ho: G= 4, 
Ay: 0 # A. 


10. Consider again the conditions of Exercise 9. Describe 
how to use the Wilcoxon-Mann-Whitney ranks test to de- 
termine a confidence interval for 6 with confidence coeffi- 
cient 1 — ag. Hint: For which values of 6) would you accept 
the null hypothesis Hp : @ = 6p at level of significance ag? 


11. Let X;,..., X,, and Y;,..., Y,, be the observations in 
two samples, and suppose that no two of these observa- 
tions are equal. Consider the mn pairs 


(X41, Y4) (X41, Yn), 
(Xo, Yy) (X,Y), 
(Xm Y)) (Xin ¥,) 


Let U denote the number of these pairs for which the value 
of the X component is greater than the value of the Y 
component. Show that 


U=S— smn + 1), 


where S is the sum of the ranks assigned to Xj,... 
as defined in this section. 


- Drs 


12. Let X1,..., X,, beii.d. with c.d.f. F independently of 
Y,,..., Y,, which are iid. withc.d.f. G. Let S be as defined 
in this section. Prove that Eq. (10.8.5) gives the mean of 
S. 


13. Under the conditions of Exercise 12, prove that Eq. 
(10.8.6) gives the variance of S. 


14. Under the conditions of Exercises 12 and 13, suppose 
further that F = G. Prove that Eqs. (10.8.5) and (10.8.6) 
agree with Eqs. (10.8.3) and (10.8.4), respectively. 


15. Consider again the conditions of Exercise 1. This time, 
let D; = X; — Y;. Wilcoxon (1945) developed the following 
test of the hypotheses (10.8.7). Order the absolute values 
|D,|,..., |D,| from smallest to largest, and assign ranks 
from 1 to n to the values. Then Sy is set equal to the 
sum of all the ranks of those |D;| such that D; > 0. 


If the distribution of D; is symmetric around 0, then the 
mean and variance of Sw are 


10.8 Sign and Rank Tests 685 


E(Sw) = a ue (10.8.8) 
Vu oe +) 0.89) 


The test rejects Hp if Sy >c, where c is chosen to make 
the test have level of significance ag. This test is called 
the Wilcoxon signed ranks test. If n is large, a normal 
distribution approximation allows us to use c = E(Syw) + 
©-!(1 — ag) Var(Sw)/?. 
a. Let W; =1 if the |D;| that gets rank i has D; > 0 
and W; = 0 if not. Show that Sw = )7j_, iW;. 


b. Prove that E(Sy) is as stated in Eq. (10.8.8) under 


the assumption that the distribution of D; is symmetric 


around 0. Hint: You may wish to use Eq. (4.7.13). 


c. Prove that Var(Sy) is as stated in Eq. (10.8.9) under 
the assumption that the distribution of D; is symmetric 


around 0. Hint: You may wish to use Eq. (4.7.14). 


16. In an experiment to compare two different materials 
A and B that might be used for manufacturing the heels of 
men’s dress shoes, 15 men were selected and fitted with a 
new pair of shoes on which one heel was made of material 
A and one heel was made of material B. At the beginning 
of the experiment, each heel was 10 millimeters thick. Af- 
ter the shoes had been worn for one month, the remaining 
thickness of each heel was measured. The results are given 
in Table 10.46. Test the null hypothesis that material A is 
not more durable than material B against the alternative 
that material A is more durable than material B, by using 
(a) the sign test of Exercise 1, (b) the Wilcoxon signed- 
ranks test of Exercise 15, and (c) the paired r test. 


Table 10.46 Data for Exercise 16 


Pair Material A Material B 
1 6.6 74 
2 7.0 5.4 
3 8.3 8.8 
4 8.2 8.0 
5 5.2 6.8 
6 9.3 9.1 
7 7.9 6.3 
8 8.5 TD 
9 78 7.0 
10 75 6.6 
11 6.1 4.4 
12 8.9 7.7 
13 6.1 4.2 
14 9.4 9.4 
15 9.1 9.1 
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10.9 Supplementary Exercises 


1. Describe how to use the sign test to form a coefficient 
1 — ap confidence interval for the median 6 of an unknown 
distribution. Use the data in Exercise 7 in Sec. 8.5 to con- 
struct the observed coefficient 0.95 confidence interval. 
Hint: For which values of 6) would you fail to reject the 
null hypothesis Hp : 6 = 6p at level of significance ag? 


2. Suppose that 400 persons are chosen at random from a 
large population, and that each person in the sample spec- 
ifies which one of five breakfast cereals she most prefers. 
Fori =1,...,5, let p; denote the proportion of the popu- 
lation that prefers cereal i, and let N; denote the num- 
ber of persons in the sample who prefer cereal 7. It is 
desired to test the following hypotheses at the level of 
significance 0.01: 


Ho: Pi = P2="** = Ps; 
H,: The hypothesis Hp is not true. 


For what values of . N? would Hp be rejected? 


3. Consider a large population of families that have ex- 
actly three children, and suppose that it is desired to test 
the null hypothesis Ho that the distribution of the number 
of boys in each family is a binomial distribution with pa- 
rameters n = 3 and p = 1/2 against the general alternative 
H, that Hp is not true. Suppose also that in a random sam- 
ple of 128 families it is found that 26 families have no boys, 
32 families have one boy, 40 families have two boys, and 
30 families have three boys. At what levels of significance 
should Ho be rejected? 


4. Consider again the conditions of Exercise 3, including 
the observations in the random sample of 128 families, but 
suppose now that it is desired to test the composite null hy- 
pothesis Ho that the distribution of the number of boys in 
each family is a binomial distribution for which n = 3, and 
the value of p is not specified against the general alterna- 
tive H, that Hp is not true. At what levels of significance 
should Ho be rejected? 


5. In order to study the genetic history of three different 
large groups of Americans, a random sample of 50 persons 
is drawn from group 1, a random sample of 100 persons is 
drawn from group 2, and a random sample of 200 persons 
is drawn from group 3. The blood type of each person in 
the samples is classified as A, B, AB, or O, and the results 
are as given in Table 10.47. Test the hypothesis that the 
distribution of blood types is the same in all three groups 
at the level of significance 0.1. 


Table 10.47 Data for Exercises 5 and 6 


A B AB O Total 
Group 1 24 6 5 15 50 
Group 2 43 24 7 26 100 
Group 3 69 47 22 62 200 


6. Consider again the conditions of Exercise 5. Explain 
how to change the numbers in Table 10.47 in such a way 
that each row total and each column total remains un- 
changed, but the value of the x? test statistic is increased. 


7. Consider a x* test of independence that is to be applied 
to the elements of a2 x 2 contingency table. Show that the 
quantity (N;; — E; ye has the same value for each of the 
four cells of the table. 


8. Consider again the conditions of Exercise 7. Show that 

the x2 statistic Q can be written in the form 

n(NyNo2 — Ny2No1)* 
N14N24N41N 42 


Q= 


9. Suppose that a x” test of independence at the level of 
significance 0.01 is to be applied to the elements of a2 x 2 
contingency table containing 47 observations, and that the 
data have the form given in Table 10.48. For what values 
of a would the null hypothesis be rejected? 


Table 10.48 Form of the data for Exercise 9 
n+a 


n—-a 


n—a n+a 

10. Suppose that a x* test of independence at the level 
of significance 0.005 is to be applied to the elements of a 
2 x 2 contingency table containing 2” observations, and 
that for some a € (0, 1) the data have the form given in 
Table 10.49. For what values of a would the null hypothesis 
be rejected? 


Table 10.49 Form of the data for Exercise 10 


an | (1—a)n 


(l—a)n | an 


11. Ina study of the health effects of air pollution, it was 
found that the proportion of the total population of city A 
that suffered from respiratory diseases was larger than the 
proportion for city B. Since city A was generally regarded 
as being less polluted and more healthful than city B, this 
result was considered surprising. Therefore, separate in- 
vestigations were made for the younger population (under 
age 40) and for the older population (age 40 or older). It 


was found that the proportion of the younger population 
suffering from respiratory diseases was smaller for city A 
than for city B, and also that the proportion of the older 
population suffering from respiratory diseases was smaller 
for city A than for city B. Discuss and explain these results. 


12. Suppose that an achievement test in mathematics was 
given to students from two different high schools A and B. 
When the results of the test were tabulated, it was found 
that the average score for the freshmen at school A was 
higher than the average for the freshmen at school B, and 
that the same relationship existed for the sophomores, the 
juniors, and the seniors at the two schools. On the other 
hand, it was found also that the average score of all the 
students at school A was lower than that of all the students 
at school B. Discuss and explain these results. Give an 
example of how this could happen. 


13. A random sample of 100 hospital patients suffering 
from depression received a particular treatment over a 
period of three months. Prior to the beginning of the treat- 
ment, each patient was classified as being at one of five 
levels of depression, where level 1 represented the most 
severe level of depression and level 5 represented the 
mildest level. At the end of the treatment, each patient 
was again classified according to the same five levels of 
depression. The results are given in Table 10.50. Discuss 
the use of this table for determining whether the treatment 
has been helpful in alleviating depression. 


Table 10.50 Data for Exercise 13 


Level of depression 
after treatment 


Level of depression 
before treatment 1 2 3 4 5 


1 7 3 0 0 
2 1 27 14 0 
3 0 0 19 8 2 
4 0 1 2 12 0 
5 0 0 1 0 


14. Suppose that a random sample of three observations 


is drawn from a distribution with the following p.d.f: 
6x?! for0<x <1, 
0 otherwise, 


fey | 


where @ > 0. Determine the p.d.f. of the sample median. 


15. Suppose that a random sample of n observations is 
drawn from a distribution for which the p.d-f. is as given 
in Exercise 14. Determine the asymptotic distribution of 
the sample median. 


16. Suppose that a random sample of n observations is 
drawn from a ¢ distribution with a > 2 degrees of free- 
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dom. Show that the asymptotic distributions of both the 
sample mean X,, and the sample median X,, are normal, 
and determine the positive integers w for which the vari- 
ance of the asymptotic distribution is smaller for X,, than 
for X,,. 


17. Suppose that X;,..., X,, form a large random sam- 
ple from a distribution for which the p.d-f. is A(x|0) = 
af (x|0) + 1 —a)g(x|@). Here f(x|0) is the p.d.f. of the 
normal distribution with unknown mean @ and variance 
1, g(x|9) is the p.d-f. of the normal distribution with the 
same unknown mean @ and variance o7, and 0 <a <1. 
Let X,, and X,, denote the sample mean and the sample 
median, respectively. 


a. Foro? = 100, determine the values of a for which the 
M.S.E. of X,, will be smaller than the M.S.E. of X,,. 


b. For a = 1/2, determine the values of 0? for which the 
M.S.E. of X,, will be smaller than the M.S.E. of X,,. 


18. Suppose that Xj, ..., X,, form a random sample from 
a distribution with p.d-f. f(x), and let Y) < Yo <---</Y, 
denote the order statistics of the sample. Prove that the 
joint p.d.-f. of Yj, ..., Y,, is as follows: 


mlf(y)-++f£On) for y, <y2< 
“tS Vy 
0 otherwise. 


BOD MI = 


19. Let Y, < Y> < Y3 denote the order statistics of a ran- 
dom sample of three observations from the uniform dis- 
tribution on the interval [0, 1]. Determine the conditional 
distribution of Y> given that Y; = y; and Y3 = y3 (0 < yj < 
y3 < 1). 


20. Suppose that a random sample of 20 observations is 
drawn from an unknown continuous distribution, and let 
Y, <--+< Yq denote the order statistics of the sample. 
Also, let 6 denote the 0.3 quantile of the distribution, and 
suppose that it is desired to present a confidence interval 
for @ that has the form (Y,, Y,,3). Determine the value 
of r(r =1, 2, ..., 17) for which this interval will have the 
largest confidence coefficient y, and determine the value 
of y. 


21. Suppose that X,..., X,,, forma random sample from 
a continuous distribution for which the p.d.f. f(x) is un- 
known; Y;,..., Y, form an independent random sample 
from another continuous distribution for which the p.d-f. 
g(x) also is unknown; and f(x) = g(x — @) for —co < x < 
oo, where the value of the parameter 6 is unknown (—oo < 
8 < co). Suppose that it is desired to carry out a Wilcoxon- 
Mann-Whitney ranks test of the following hypotheses at 
a specified level of significance a (0 <a < 1): 


Ho: 6=6, 
Ay: 6 #O%. 


Assume that no two of the observations are equal, and 
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let Up, denote the number of pairs (X;, Y;) such that X; — 
Y; > 0, wherei=1,...,mand j =1,...,7. Show that 
for large values of m and n, the hypothesis Hp should not 
be rejected if and only if 


1/2 
mn 6-1 (1 <) [ment ) / 


2 2 12 


12 
< Ug, < ie (1 <) een , 


12 


where ©~! is the quantile function of the standard normal 
distribution. Hint: See Exercise 11 of Sec. 10.8. 


22. Consider again the conditions of Exercise 21. Show 
that a confidence interval for 6 with confidence coefficient 
1 —a can be obtained by the following procedure: Let k 
be the largest integer less than or equal to 


1/2 
mn o-1 (1 | [mamta ny 
2 2 12 


Also, let A be the kth smallest of the mn differences X; — 
Y;, where i=1,...,m and j=1,...,n, and let B be 
the kth largest of these mn differences. Then the interval 
A <6 < Bisaconfidence interval of the required type. 


23. The sign test can be extended to a test of hypotheses 
about an arbitrary quantile of a distribution rather than 
just the median. Let 6,, be the p quantile of a distribution, 
and suppose that X;,..., X, form an ii.d. sample from 
this distribution. 


a. Let b be an arbitrary number. Explain how to con- 
struct a version of the sign test for the hypotheses 


Ho: 0, =b, 
Ky: 6, #6, 


at level of significance ap. (Construct an equal-tailed 
test if you wish.) 

b. Show how to use this version of the sign test to form 
a coefficient 1 — ag confidence interval for @,,. 
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Example 
LI.1.1 


11.1 The Method of Least Squares 


When each observation from an experiment is a pair of numbers, it is often 
important to try to predict one of the numbers from the other. Least squares is 
a method for constructing a predictor of one of the variables from the other by 
making use of a sample of observed pairs. 


Fitting a Straight Line 


Blood Pressure. Suppose that each of 10 patients is treated with the same amount of 
two different drugs that can affect blood pressure. To be specific, each patient is first 
treated with a standard drug A, and their change in blood pressure is measured. After 
the effect of the drug wears off, the patient is treated with an equal amount of a new 
drug B, and their change in blood pressure is measured again. These changes in blood 
pressure will be called the reaction of the patient to each drug. Fori = 1,..., 10, we 
shall let x; denote the reaction, measured in appropriate units, of the ith patient to 
drug A, and we shall let y; denote her reaction to drug B. The observed values of 


the reactions are as given in Table 11.1. The 10 points (x;, y;) fori =1,..., 10 are 
plotted in Fig. 11.1. One purpose of the study is to try to predict a patient’s reaction 
to drug B if their reaction to the standard drug A is already known. < 


In Example 11.1.1, suppose that we are interested in describing the relationship 
between the reaction y of a patient to drug B and her reaction x to drug A. In order 
to obtain a simple expression for this relationship, we might wish to fit a straight line 
to the 10 points plotted in Fig. 11.1. Although these 10 points obviously do not lie 
exactly on a straight line, we might believe that the deviations from such a line are 
caused by the fact that the observed change in the blood pressure of each patient is 
affected not only by the two drugs but also by various other factors. In other words, 
we might believe that if it were possible to control all of these other factors, the 
observed points would actually lie on a straight line. We might believe further that 
if we measured the reactions to the two drugs for a very large number of patients, 
instead of for just 10 patients, we would then find that the observed points tend to 
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Figure I1.1 A plot of the 
observed values in Table 11.1. 


Table I 1.1 Reactions to two drugs 
i Xj Ji 
1 1.9 0.7 
2 0.8 —1.0 
3 1.1 —0.2 
4 0.1 —1.2 
5) —0.1 —0.1 
6 4.4 3.4 
7 4.6 0.0 
8 1.6 0.8 
9 5.5 chy 
10 3.4 2.0 
YA 
5 
4 
e 
e 
3 
2 ° 
i ez 
-| ° ep 3 4 °S 6 Tx 
—1|—e 
|e 


cluster along a straight line. Perhaps we might also wish to be able to predict the 
reaction y of a future patient to the new drug B on the basis of her reaction x to 
the standard drug A. One procedure for making such a prediction would be to fit a 
straight line to the points in Fig. 11.1, and to use this line for predicting the value of 
y corresponding to each value of x. 

It can be seen from Fig. 11.1 that if we did not have to consider the point (4.6, 0.0), 
which is obtained from the patient for whom i = 7 in Table 11.1, then the other nine 
points lie roughly along a straight line. One arbitrary line that fits reasonably well to 
these nine points is sketched in Fig. 11.2. However, if we wish to fit a straight line 
to all 10 points, it is not clear just how much the line in Fig. 11.2 should be adjusted 
in order to accommodate the anomalous point. We shall now describe a method for 
fitting such a line. 


Figure [1.2 A straight line 
fitted to nine of the points in 
Table 11.1. 


Figure 11.3 Vertical devi- 
ations of the plotted points 
from a straight line. 


Example 
11.1.2 
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YA 
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The Least-Squares Line 


Blood Pressure. In Example 11.1.1, suppose that we are interested in fitting a straight 
line to the points plotted in Fig. 11.1 in order to obtain a simple mathematical 
relationship for expressing the reaction y of a patient to the new drug B as a function 
of her reaction x to the standard drug A. In other words, our main objective is to 
be able to predict closely a patient’s reaction y to drug B from her reaction x to 
drug A. We are interested, therefore, in constructing a straight line such that, for 
each observed reaction x;, the corresponding value of y on the straight line will be 
as close as possible to the actual observed reaction y;. The vertical deviations of the 
10 plotted points from the line drawn in Fig. 11.2 are sketched in Fig. 11.3. < 


One method of constructing a straight line to fit the observed values is called the 
method of least squares, which chooses the line to minimize the sum of the squares of 
the vertical deviations of all the points from the line. We shall now study the method 
of least squares in more detail. 
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Theorem 
IL.1.1 


Definition 
IL.1.1 


Least Squares. Let (x1, y1),..-, (Xn, Y,) be a set of n points. The straight line that 
minimzes the sum of the squares of the vertical deviations of all the points from the 
line has the following slope and intercept: 


Ya — Y)(%; — x) 
ya; x)?” (11.1.1) 
bo = — Aix, 
where x = 1 1 4 and y = 7 ee 


n 


p= 


Proof Consider an arbitrary straight line y = 6) + 6,x, in which the values of the 
constants 6) and f, are to be determined. When x = x,, the height of this line is 
Bo + 6,x;. Therefore, the vertical distance between the point (x;, y;) and the line is 
ly; — (Bo + 61x;)|. Suppose that the line is to be fitted to n points. The sum of the 
squares of the vertical distances at the n points is 


O=) [yi — (Bo + Bix. (CLT) 
i=l 


We shall minimize Q with respect to By and £, by taking the partial derivatives and 
setting them to 0. We have 


ae, XC? — Bo — B1xi) (11.1.3) 
dBo j=1 
and 
ee, S01 — Bo — Bix) %i- (11.1.4) 
opi = 


By setting each of these two partial derivatives equal to 0, we obtain the following 
pair of equations: 


n n 
Bon + Bi ox =>0 
i=1 i=l 
n n n 
Bo >) Xi +B > xt=)- xiy;. 
i=l i=l i=l 


The equations (11.1.5) are called the normal equations for By and £,. By consid- 
ering the second-order derivatives of Q, we can show that the values of 69 and f; 
that satisfy the normal equations will be the values for which the sum of squares Q 
in Eq. (11.1.2) is minimized. Solving (11.1.5) yields the values in (11.1.1). . 


(11.1.5) 


Least-Squares Line. Let By and f; be as defined in (11.1.1). The line defined by the 
equation y = Bo + Bix is called the least-squares line. 


For the values given in Table 11.1, n = 10, and it is found from Eq. (11.1.1) 
that 6) = —0.786 and f, = 0.685. Hence, the equation of the least-squares line is 
y = —0.786 + 0.685x. This line is sketched in Fig. 11.4. 

Virtually all statistical computer software will compute the least-squares regres- 
sion line. Even some handheld calculators will do the calculation. 


Figure 11.4 The least- 
squares straight line. 
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Fitting a Polynomial by the Method of Least Squares 


Suppose now that instead of simply fitting a straight line to n plotted points, we wish 
to fit a polynomial of degree k (k > 2). Such a polynomial will have the following 
form: 


y = Bo + Bux + Box? +--+ + Byxt. (11.1.6) 


The method of least squares specifies that the constants fo, ..., 6; Should be chosen 
so that the sum @Q of the squares of the vertical deviations of the points from the curve 
is a minimum. In other words, these constants should be chosen so as to minimize 
the following expression for Q: 


n 
O=) Ly — (Bo + Biss +--+ + Bex) P- (7) 
i=l 
If we calculate the k + 1 partial derivatives 0Q/0dfp, ..., 0Q/0B,, and we set 
each of these derivatives equal to 0, we obtain the following & + 1 linear equations 
involving the k + 1 unknown values fp, ... , Bx: 
n n n 
Bon + Bid mete +& Doxf=>o y, 
i=1 i=1 i=l 
n n n n 
Bo Yo Xi +B do x2 tet Be Dox => xy, 
i=1 i=l i=l i=l (11.1.8) 


n n n n 
k k+1 2k k 
Bo) xi + BY xj +--+ Be > x; =) X; Vie 
i=1 i=1 i=l i=1 


As before, these equations are called the normal equations. If the normal equa- 
tions have a unique solution, that solution provides the minimum value for Q. A 
necessary and sufficient condition for a unique solution is that the determinant of 
the (k + 1) x (k + 1) matrix formed by the coefficients of Bp, ..., 6, in Eq. (11.1.8) 
is not zero. We shall now assume that this is the case. If we denote the solution as 
(Bo, ..., By), then the least-squares polynomial is y = By + Byx +--+ + By x*. 
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Figure 11.5 The least- 
squares parabola. 


Example 
11.1.3 


Example 
11.1.4 


Figure 11.6 Plot of miles 
per gallon versus engine 
horsepower for 173 cars in 
Example 11.1.4. The least- 
squares parabola is also 
drawn in the plot. 


Straight line 


Fitting a Parabola. Suppose that we wish to fit a polynomial of the form y = By) + 
B,x + Bx (which represents a parabola) to the 10 points given in Table 11.1. In this 
example, it is found that the normal equations 11.1.8 are as follows: 


10By + 23.36; + 90.37B> = 8.1, 
23.3By + 90.37B; + 401.0B, = 43.59, (11.1.9) 
90.376) + 401.0f; + 1892.78 = 204.55. 


The unique values of 8p, 6;, and > that satisfy these three equations are Bo = —0.744, 
f, = 0.616, and B, = 0.013. Hence, the least-squares parabola is 


y = —0.744 + 0.616x + 0.013x?. (11.1.10) 


This curve is sketched in Fig. 11.5 together with the least-squares straight line. Be- 
cause the coefficient of x? in Eq. (11.1.10) is so small, the least-squares parabola 
and the least-squares straight line are very close together over the range of values 
included in Fig. 11.5. 4 


Gasoline Mileage. Heavenrich and Hellman (1999) report several variables measured 
on 173 different cars. Among those variables are gasoline mileage (in miles per 
gallon) and engine horsepower. A plot of miles per gallon versus horsepower is shown 
in Fig. 11.6 together with a parabola fit by least squares. Even without the curve 


Miles per gallon 


Engine horsepower 
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drawn in Fig. 11.6, it is clear that a straight line would not provide an adequate fit to 
the relationship between these two variables. Some sort of curved relationship must 
be fit. The least-squares parabola curves up for the largest values of horsepower, 
which is somewhat counterintuitive. Indeed, this might be an example in which it 
would pay to use some prior information to impose a constraint on the fitted curve. 
Alternatively, we could replace gasoline mileage by a curved function of miles per 
gallon and use this curved function as the y variable. 4 


Fitting a Linear Function of Several Variables 


We shall now consider an extension of the example discussed at the beginning of this 
section, in which we were interested in representing a patient’s reaction to a new drug 
B as a linear function of her reaction to drug A. Suppose that we wish to represent 
a patient’s reaction to drug B as a linear function involving not only her reaction to 
drug A but also some other relevant variables. For example, we may wish to represent 
the patient’s reaction y to drug B as a linear function involving her reaction x; to drug 
A, her heart rate x», and blood pressure x3 before she receives any drugs, and other 
relevant variables x4, ..., x;. 

Suppose that for each patient i (i = 1, ..., m) we measure her reaction y; to drug 
B, her reaction x;, to drug A, and also her values x;2, ... , x;, for the other variables. 
Suppose also that in order to fit these observed values for the n patients, we wish to 
consider a linear function having the form 


Y = Bo t+ Byxy +++ + Berg. (11.1.11) 
In this case, also, the values of Bo, ... , 6, can be determined by the method of least 
squares. For each given set of observed values x;1,..., x;,, We again consider the 


difference between the observed reaction y; and the value Bo + Byx;; +--+ + Byxiz 
of the linear function given in Eq. (11.1.11). As before, it is required to minimize the 
sum Q of the squares of these differences. Here, 


Q= Sib; = (By + Bini +++? + Beak. (14,112) 
i=l 


We minimize this the same way that we minimized (11.1.7), namely, by setting the 
partial derivatives of Q with respect to each 6; equal to 0 for j =0,..., k. In this 
case, the k + 1 normal equations have the following form: 


n n n 
Bon + Bi do xt + Be Do x= do yi 
jal i=l 


i=1 


7 n n n 
Bo Dixit Bi Do xin + Be Do tne = Do aay 11.1.13) 
71 i=l i=l = oe 


n 


n n n 
2 
Bo > Xig + By pS Xigkjy tos + By Xp = ie XiKYi- 
i=l i=l 


i=1 i=1 


If the normal equations have a unique solution, we shall denote that solution 


(Bo, ..-, By), and the least-squares linear function will then be y = ay oe ee 
f,.x,. As before, a necessary and sufficient condition for a unique solution is that the 
determinant of the (k + 1) x (kK + 1) matrix formed by the coefficients of Bo, ..., By 


in Eq. (11.1.13) is not zero. 
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Example 
11.1.5 


Table 11.2 Reactions to two drugs and 
heart rate 

i Xi Xi2 Yi 
1 1.9 66 0.7 
2 0.8 62 —1.0 
3 Hae 64 —0.2 
4 0.1 61 —1.2 
5 —0.1 63 —0.1 
6 4.4 70 3.4 
7 4.6 68 0.0 
8 1.6 62 0.8 
9 SES) 68 3.7 

10 3.4 66 2.0 


Fitting a Linear Function of Two Variables. Suppose that we expand Table 11.1 to 
include the values given in the third column in Table 11.2. Here, for each patient 
i @ =1,..., 10), x;; denotes her reaction to the standard drug A, x;2 denotes her 
heart rate, and y; denotes her reaction to the new drug B. Suppose also that we wish 
to fit a linear function to these values having the form y = Bg + 61x; + fox. 

In this example, it is found that the normal equations (11.1.13) are 


10By + 23.36; + 6508) = 8.1, 
23.3By + 90.37B; + 1563.6B> = 43.59, (11.1.14) 
650By + 1563.68; + 42, 3346) = 563.1. 


The unique values of fo, 6;, and f> that satisfy these three equations are bo= 
—11.4527, p, = 0.4503, and £, = 0.1725. Hence, the least-squares linear function is 


y = 11.4527 + 0.4503x, + 0.1725x. (11.1.15) 
< 


It should be noted that the problem of fitting a polynomial of degree & involving 
only one variable, as specified by Eq. (11.1.6), can be regarded as a special case of 
the problem of fitting a linear function involving several variables, as specified by 
Eq. (11.1.11). To make Eq. (11.1.11) applicable to the problem of fitting a polynomial 
having the form given in Eq. (11.1.6), we define the k variables x,,..., x, simply as 
Xy =X, X= Xx, drag te aa’ 

A polynomial involving more than one variable can also be represented in the 
form of Eq. (11.1.11). For example, suppose that the values of four variables r, s, t, 
and y are observed for several different patients, and we wish to fit to these observed 
values a function having the following form: 


y = By + Bir + Bor? + Bars + Bas” + Bst? + Borst. (1.1.16) 


We can regard the function in Eq. (11.1.16) as a linear function having the form given 
in Eq. (11.1.11) with k = 6 if we define the six variables x), ..., x6 as follows: x; =r, 
Xp = 17, X3=15,X4=5", x5 = 0°, and x6 = rst. 


Summary 
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The method of least squares allows the calculation of a predictor for one variable (y) 


based on one or more other variables (xy, . . 


The coefficients Bo, . 


., X,) of the form Bo + Bix, +: +++ Byxy. 


.., Bj, are chosen so that the sum of squared differences between 


observed values of y and observed values of By + Bix, +--+ + Bx, iS as small as 
possible. Algebraic formulas for the coefficients are given for the case k = 1, but 
most statistical computer software will calculate the coefficients more easily. 


Exercises 


1. Prove that 7" y(cyx; + ¢9)? = cf Vy; — ¥)? + 
n(cyx + c)?. 

2. Show that the value of A, in Eq. (11.1.1) can be rewrit- 
ten in each of the following three forms: 


a a 
x ya Mii — XY 


a. pi = St Pong? 
b A = Vai — X)Y; 
eat as a 
“ A eatery) 
1 — 2? 


3. Show that the least-squares line y = fy + B)x passes 
through the point (x, y). 


4, For i =1,...,n, let $; = By + Bjx;. Show that Jy and 
A, as given by Eq. (11.1.1), are the unique values of By 
and 6, such that 


Oi - 5) =O and J) x;0;- 5) =0. 
i=l i=l 


5. Fit a straight line to the observed values given in 
Table 11.1 so that the sum of the squares of the horizontal 
deviations of the points from the line is a minimum. Sketch 
on the same graph both this line and the least-squares line 
given in Fig. 11.4. 


6. Suppose that both the least-squares line and the least- 
squares parabola were fitted to the same set of points. 
Explain why the sum of the squares of the deviations of 
the points from the parabola cannot be larger than the 
sum of the squares of the deviations of the points from 
the straight line. 


7. Suppose that eight specimens of a certain type of alloy 
were produced at different temperatures, and the dura- 
bility of each specimen was then observed. The observed 
values are given in Table 11.3, where x; denotes the tem- 
perature (in coded units) at which specimen i was pro- 


duced and y,; denotes the durability (in coded units) of 
that specimen. 


Table 11.3 Data for Exercise 7 


i Xj Yi 
1 0.5 40 
2 1.0 41 
3 1.5 43 
4 2.0 42 
5 22 44 
6 3.0 42 
7 3:5 43 
8 4.0 42 


a. Fit a straight line of the form y = Bo + f,x to these 
values by the method of least squares. 


b. Fit a parabola of the form y = By + Bix + fox? to 
these values by the method of least squares. 


c. Sketch on the same graph the eight data points, the 
line found in part (a), and the parabola found in 
part (b). 


8. Let (x;, y;) for i=1,...,k+1, denote k+1 given 
points in the x y-plane such that no two of these points have 
the same x-coordinate. Show that there is a unique polyno- 
mial having the form y = By + fix +--+: + B,x* that passes 
through these k + 1 points. 


9. The resilience y of a certain type of plastic is to be 
represented as a linear function of both the temperature 
x, at which the plastic is baked and the number of min- 
utes x, for which it is baked. Suppose that 10 pieces of 
plastic are prepared by using different values of x; and 
Xj, and the observed values in appropriate units are as 
given in Table 11.4. Fit a function having the form y = 
Bo + Bix1 + Box2 to these observed values by the method 
of least squares. 
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10. Consider again the observed values presented in Table 11. Consider again the observed values presented in Table 
11.4. Fit a function having the form y = B,x1 + Box + B3x5 11.4, and consider also the two functions that were fitted 


to these values by the method of least squares. 


to these values in Exercises 9 and 10. Which of these two 
functions fits the observed values better? 


Table 11.4 Data for Exercise 9 


i Xi XQ Yi i Xn 2 i 

1 100 1 113 6 120 2 144 

2 100 2 118 7 120 3 = 138 

3 110 1 127 8 130 1 146 

4 110 2 132 9 130 2 156 

5 120 1 136 10 130 3 = 149 
11.2 Regression 
In Sec. 11.1, we introduced the method of least squares. This method computes 
coefficients for a linear function to predict one variable y based on other variables 
Xy,...,X,. In this section, we assume that the y values are observed values of a 
collection of random variables. In this case, there is a statistical model in which the 
method of least squares turns out to produce the maximum likelihood estimates 
of the parameters of the model. 
Regression Functions 

Example Pressure and the Boiling Point of Water. Forbes (1857) reports the results from ex- 


11.2.1 


periments that were trying to obtain a method for estimating altitude. A formula 
is available for altitude in terms of barometric pressure, but it was difficult to carry 
a barometer to high altitudes in Forbes’ day. However, it might be easy for trav- 
elers to carry a thermometer and measure the boiling point of water. Table 11.5 
contains the measured barometric pressures and boiling points of water from 17 ex- 
periments. We can use the method of least squares to fit a linear relationship between 
boiling point and pressure. Let y; be the pressure for one of Forbes’ observations, 
and let x; be the corresponding boiling point for i =1,..., 17. Using the data in 
Table 11.5, we can compute the least-squares line. The intercept and slope are, re- 
spectively, £y = — 81.06 and f, = 0.5229. Of course, we do not expect that the line 

y =— 81.06 + 0.5229.x precisely gives the relationship between boiling point x and 
pressure y. If we learn the boiling point x of water and want to compute the condi- 
tional distribution of the unknown pressure Y, is there a statistical model that allows 
us to say what the (conditional) distribution of pressure is given that the boiling point 
is x? < 


In this section, we shall describe a statistical model for problems such as the one 
in Example 11.2.1. Fitting this statistical model will make use of the method of least 
squares. We shall study problems in which we are interested in learning about the 
conditional distribution of some random variable Y for given values of some other 
variables Xj,..., X;. The variables Xj, ..., X; may be random variables whose 
values are to be observed in an experiment along with the values of Y, or they may be 
control variables whose values are to be chosen by the experimenter. In general, some 
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Table 11.5 Boiling point of water in degrees Fahrenheit 
and atmospheric pressure in inches of mercury 
from Forbes’ experiments. These data are 
taken from Weisberg (1985, p. 3). 

Boiling Point Pressure 
194.5 20.79 
194.3 20.79 
197.9 22.40 
198.4 22.67 
199.4 23.15 
199.9 23.35 
200.9 23.89 
201.1 23.99 
201.4 24.02 
201.3 24.01 
203.6 25.14 
204.6 26.57 
209.5 28.49 
208.6 27.76 
210.7 29.04 
211.9 29.88 
212.2 30.06 


of these variables might be random variables, and some might be control variables. In 
any case, we can study the conditional distribution of Y given X,,..., X;,. We begin 
with some terminology. 


Response/Predictor/Regression. The variables X,,..., X, are called predictors, and 
the random variable Y is called the response. The conditional expectation of Y 
for given values x1,..., x, of X1,..., X;, is called the regression function of Y on 
X1,..., X,, or simply the regression of Y on Xy,..., Xx. 
The regression of Y on Xj, ..., X,isafunction of the values x,,..., x, 0f X1,..., X,. 
In symbols, this function is E(Y|x1, ..., x,). 
In this chapter, we shall assume that the regression function E(Y|x1,..., xj) is 
a linear function having the following form: 
E(¥ |x4, .. 5X4) = Bo + Byxy +++ + + Byxy. (11.2.1) 


The coefficients Bo, ..., 6, in Eq. (11.2.1) are called regression coefficients. We shall 
suppose that these regression coefficients are unknown. Therefore, they are to be 
regarded as parameters whose values are to be estimated. We shall suppose also 
that n vectors of observations are obtained. For i =1,...,, we shall assume that 
the ith vector (x;;,..., X;,, y;) consists of a set of controlled or observed values of 
X1,..., X; and the corresponding observed value of Y. 
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Assumption 
11.2.1 


Assumption 
11.2.2 


Assumption 
11.2.3 


Assumption 
11.2.4 


Assumption 
11.2.5 


One set of estimators of the regression coefficients Bp, ..., 6, that can be cal- 
culated from these observations is the set of values By, ..., A; that are obtained by 
the method of least squares, as described in Sec. 11.1. These estimators are called the 
least-squares estimators of Bp, ..., By. We shall now specify some further assump- 
tions about the conditional distribution of Y given X,,..., X;, in order to be able to 
determine in greater detail the properties of these least-squares estimators. 


Simple Linear Regression 


We shall consider first a problem in which we wish to study the regression of Y on just 
a single variable X. We shall assume that for each value X = x, the random variable Y 
can be represented in the form Y = fp + fix + €, where ¢ isarandom variable that has 
the normal distribution with mean 0 and variance o?. It follows from this assumption 
that the conditional distribution of Y given X = x is the normal distribution with 
mean fy + 61x and variance o?. 

A problem of this type is called a problem of simple linear regression. Here the 
term simple refers to the fact that we are considering the regression of Y on just a 
single variable X, rather than on more than one variable; the term /inear refers to 
the fact that the regression function E(Y|x) = Bo + 6,x is a linear function of the 
parameters 6 and f,. For example, a problem in which E(Y |x) is a polynomial, like 
the right side of Eq. (11.1.6), would also be a linear regression problem, but not 
simple. 

Throughout this section (and the next two sections), we shall consider the prob- 
lem in which we shall observe n pairs (x1, Y,),..., (Xn, Y,). We shall make the fol- 
lowing five assumptions. Each of these assumptions has a natural generalization to 
the case in which there is more than one predictor, but we shall postpone discussion 
of that case until Sec. 11.5. 


Predictor is known. Either the values x1, ..., x,, are known ahead of time or they are 
the observed values of random variables Xj, ..., X,, on whose values we condition 
before computing the joint distribution of (Yj, ..., Y,,). 


Normality. For i =1,...,”, the conditional distribution of Y; given the values 
X1,...,X, iS anormal distribution. 


Linear Mean. There are parameters fy and f, such that the conditional mean of Y; 
given the values x,,..., x, has the form 6) + 6,x; fori =1,...,7. 


Common Variance. There is a parameter o7 such that the conditional variance of Y; 
given the values x, ..., x, iso? fori =1,..., n. This assumption is often called ho- 
moscedasticity. Random variables with different variances are called heteroscedastic. 


Independence. The random variables Y,..., Y,, are independent given the observed 
X45 2225 Xy- 


A brief word is in order about Assumption 11.2.1. In Example 11.1.1, we saw that 
the reaction x; of patient i to standard drug A is observed as part of the experiment 
along with the reaction y; to drug B. Hence, the predictors are not known in advance. 
In this case, all probability statements that we make in this example are conditional on 
(x1, ...,X,). In other examples, one might be trying to predict an economic variable 
using the year in which it was measured. In such cases, such as Example 11.5.1, which 
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we will see later, the values of at least some of the predictors are truely known in 
advance. 


Assumptions 11.2.1-11.2.5 specify the conditional joint distribution of Y;,..., Y,, 
given the vector x = (x,, ..., x,,) and the parameters fo, 6;, and 07. In particular, the 
conditional joint p.d.f. of Y;,..., Y,, is 


i 1X 

2 2 

xX, Bo, Br, = ex ‘ yes 1142.2 
fal Ibe, Bos Br 0°) = Gas | re Dio Bo — Bix) (11.2.2) 
We can now find the M.L.E.’s of Bo, Bi, and 0. 


Simple Linear Regression M.L.E.’s. Assume Assumptions 11.2.1-11.2.5. The M.L.E.’s 
of By and B; are the least-squares estimates, and the M.L.E. of o? is 


=! 3 o- Bo — Bix): (11.2.3) 


i=l 


Proof For each observed vector y = (yy, ..., y,), the p.d-f. (11.2.2) will be the like- 
lihood function of the parameters fp, 6,, and o?. In Eq. (11.2.2), By and 6, appear 
only in the sum of squares 


n 
Q= V0; =Po= Bix;)°, 
i=1 
which in turn appears in the exponent multiplied by —1/[207]. Regardless of the 
value of o”, the exponent is maximized over B, and f, by minimizing Q. It follows 
that the M.L.E.’s can be found in sequence by first minimizing Q over fp and fj, then 
inserting the values By and A, that provide the minimum of Q, and finally minimizing 
the result over o”. The reader will note that Q is the same as the sum of squares in 
Eq. (11.1.2), which is minimized by the method of least squares. Thus, the M.L.E.’s 
of the regression coefficients 69 and £, are precisely the same as the least-squares 
estimates. The exact form of these estimates By and A, was given in Eq. (11.1.1). 

To find the M.L.E. of 07, perform the the second and third steps described in the 
preceding paragraph, namely, first replace 6, and £, in Eq. (11.2.2) by their M.L.E.’s 
Bo and Bi and then maximize the resulting expression with respect to 07. The details 
are left to Exercise 1 at the end of this section, and the result is (11.2.3). a 


The Distribution of the Least-Squares Estimators 


We shall now present the joint distribution of the estimators fy) and ,; when they 


are regarded as functions of the random variables Y;,..., Y,, for given values of 
X1,..., X,. Specifically, the estimators are 
j, — Lei = DG -¥) 
DiGi — 
Bo = BX 


where Y = Y,. 


n i=1 “4 
It is convenient, both for this section and the next, to introduce the symbol 


n 1/2 
S= (2. = a] (11.2.4) 
i=l 
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Theorem 
11.2.2 


Distributions of Least-Squares Estimators. Under Assumptions 11.2.1-11.2.5, the dis- 
tribution of 6; is the normal distribution with mean f,; and variance lags The 
distribution of Bo is the normal distribution with mean fp and variance 


1 =2 
o? (: 4 =) (11.2.5) 


Finally, the covariance of B, and Bp is 


Xo2 


Cov(o, Bi) = a (11.2.6) 


x 


(All of the distributional statements in this theorem are conditional on X; = x; for 
i=1,...,nif X,,..., X, are random variables.) 


Proof To determine the distribution of Aj, it is convenient to write A; as follows (see 
Exercise 2 at the end of Sec. 11.1): 


e ” (x; —X)Y; 
j= Siti (11.2.7) 
a 
x 
It can be seen from Eq. (11.2.7) that A, isa linear function of Y;,..., Y,,. Because the 
random variables Y;,..., Y, are independent and each has a normal distribution, it 


follows that £, will also have a normal distribution. Furthermore, the mean of this 
distribution will be 


é "4(x; —X)E(; 
(fy = st DEO 


x 


Because E(Y;) = By + 61x; fori =1,...,n, it can now be found (see Exercise 2 at 
the end of this section) that 


E(h,) = By. (11.2.8) 


Furthermore, because the random variables Y;, ..., Y,, areindependent and each 
has variance o”, it follows from Eq. (11.2.7) that 


ier; — ¥)? Var(¥;) _ 


4 2° 
x Sy 


Var (61) = (11.2.9) 


AY 


Next, consider the distribution of By = Y — £,x. Because both Y and f; are linear 
functions of Y,,..., Y,, it follows that , is also a linear function of Y;, ..., Y,,. Hence, 
Bo will have a normal distribution. The mean of Bo can be determined from the 
relation E(Bo) = E(Y) — ¥E(f}). It can be shown (see Exercise 3) that E(Bo) = Bo. 
Furthermore, it can be shown (see Exercise 4) that Var (Ap) is given by (11.2.5). Finally, 
it can be shown (see Exercise 5) that the value of the covariance between fy and f; 
is given by (11.2.6). | 


A simple corollary to Theorem 11.2.2 is that By and A, are, respectively, unbiased 
estimators of the corresponding parameters By and A. 

To complete the description of the joint distribution of By and Ay, it will be shown 
in Sec. 11.3 that this joint distribution is the bivariate normal distribution for which 
the means, variances, and covariance are as stated in Theorem 11.2.2. 


Example 
11.2.2 


Example 
11.2.3 


Example 
11.2.4 
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Pressure and the Boiling Point of Water. In Example 11.2.1, we found the least-squares 
line for predicting pressure from boiling point of water. Suppose that we use the linear 
regression model just described as a model for the data in this experiment. That is, 
let Y; be the pressure for one of Forbes’ observations, and let x; be the corresponding 
boiling point fori =1,..., 17. We model the Y; as being independent with means 
Bo + Bx; and variance o”. The average temperature is ¥ = 202.95 and se = 530.78 
with n = 17. From these values, we can now compute the variances and covariances of 
the least-squares estimators using the formulas derived in this section. For example, 


o2 


V: = 0.0018807, 
aA) = 078 i 
i . 20209 2 
Vi = = 77.660~, 
angie aC * 530.78 7 
x ¥ 202.9507 2 
C , By) = -————_ = -0.3820°. 
SNL 07g ° 
It is easy to see that we expect to get a much more precise estimate of 6, than of Bp. 


< 


The statement at the end of Example 11.2.2 about getting more precise estimates 
of 6, than of Bp is a bit deceptive. We must multiply 6, by a number on the order of 
200 before it is on the same scale as By. Hence, it might make more sense to compare 
the variance of 2008; to the variance of Bo: In general, we can find the variance of 
any linear combination of the least-squares estimators. 


The Variance of a Linear Combination. Very often, we need to compute the variance 
of a linear combination of the least-squares estimators. One example is prediction, 
as i i later in this section. Suppose that we wish to compute the variance of 

= = coBo + 1B +c, The variance of T can be found by substituting the values of 
ae Var(f,), and Cov(p, 6;) given in Eqs. (11.2.5), (11.2.9), and (11.2.6) in the 
following relation: 


Var(T) = ee Var (By) + a Var(B1) + 2cocy Cov(p, Bi). 


When these substitutions have been made, the result can be written in the following 
form: 


2 
sy 


Var(T) =o (24 + ota) (1.2.10) 

n 
For the specific case of Example 11.2.2, we have cg = 0 and c, = 200, so the variance 
of 2006; is 20070? Ise = 75.3607. This is pretty close to the variance of 8y, namely, 


77.6602. < 


Prediction 


Predicting Pressure from the Boiling Point of Water. In Example 11.2.1, Forbes was 
trying to find a way to use the boiling point of water to estimate the barometric 
pressure. Suppose that a traveler measures the boiling point of water to be 201.5 
degrees. What estimate of barometric pressure should they give and how much 
uncertainty is there about this estimate? < 
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Theorem 
11.2.3 


Example 
11.2.5 


Suppose that n pairs of observations (x1, Y;),..., (%,, Y,) are to be obtained in a 
problem of simple linear regression, and on the basis of these n pairs, it is necessary 
to predict the value of an independent observation Y that will be obtained when a 
certain specified value x is assigned to the control variable. Since the observation Y 
will have the normal distribution with mean By + 61x and variance o”, it is natural to 
use the value Y = Bo + Bix as the predicted value of Y. We shall now determine the 
M.S.E. E[(Y — Y)?] of this prediction, where both Y and Y are random variables. 


M.S.E. of Prediction. In the prediction problem just described, 


=)\2 
E[(v —Y)]= a ig oF (11.2.11) 
n 


Ss 
x 


Proof In this problem, E(Y)=E(Y)= Bo + Bix. Thus, if we let 4p = By + B,x, then 


E[(Y —YY]=E(Y —w) -(Y - wf} 


e : (11.2.12) 
= Var(Y ) + Var(Y) — 2 Cov(Y, Y). 


However, the random variables Y and Y are independent, because Y is a function 
of the first n pairs of observations and Y is an independent observation. Therefore, 
Cov(Y, Y) =0, and it follows that 


E[(Y — Y)*]= Var(Y ) + Var(Y). (11.2.13) 


Finally, because ¥ = By) + fx, the value of Var(Y) is given by Eg. (11.2.10) 
with cy = 1 and c; = x. Also Var(Y) =o. Substituting these into Eq. (11.2.13) gives 
(11.2.11). a 


Predicting Pressure from the Boiling Point of Water. In Example 11.2.4, we wanted to 
predict barometric pressure when the boiling point of water is 201.5 degrees. The 
least-squares line is y = — 81.06 + 0.5229x , and G? = 0.0478. Fig. 11.7 shows the data 
plotted together with the least-squares regression line and the location of the point 
on the line that has x = 201.5. The M.S.E. of the prediction of pressure Y is obtained 
from Eq. (11.2.11): 


1 (201.5 — 202.95)? 
17 530.78 


E[(Y —Y)*]= a a = 1.062807, 


and the observed value of the prediction is ¥ = —81.06 + 0.5229 x 201.5 = 24.30. The 
calculation of Y is illustrated in Fig. 11.7. The M.S.E. 1.06280? can be interpreted as 
follows: If we knew the values of £9 and f; and tried to predict Y, the M.S.E. would 
be Var(Y) =o. Having to estimate fy and A, only costs us an additional 0.062807 in 
M.S.E. < 


Note: M.S.E. of Prediction Increases as x Moves Away from Observed Data. The 
M.S.E. in Eq. (11.2.11) increases as x moves away from X, and it is smallest when 
x =x. This indicates that it is harder to predict Y when x is not near the center of 
the observed values x1, ..., x,. Indeed, if x is larger than the largest observed x; or 
smaller than the smallest one, it is quite difficult to predict Y with much precision. 
Such predictions outside the range of the observed data are called extrapolations. 


Figure 11.7 Plot of pres- 
sure versus boiling point 
with regression line for Ex- 
ample 11.2.5. Dotted line 
illustrates prediction of pres- 
sure when boiling point is 
201.5. 
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Pressure 


201.5 


Boiling point 


6 | Design of the Experiment 


Consider a problem of simple linear regression in which the variable X is a control 
variable whose values x), ..., x, can be chosen by the experimenter. We shall discuss 
methods for choosing these values so as to obtain good estimators of the regression 
coefficients By and f,. 

Suppose first that the values x1, ..., x, are to be chosen so as to minimize the 
M.S.E. of the least-squares estimator Ay. Since Ap is an unbiased estimator of Bo, the 
M.S.E. of Ap is equal to Var(o), as given in Eq. (11.2.5). It follows from Eq. (11.2.5) 
that Var(By) > o2/n for all values x, ... , x,, and there will be equality in this relation 
if and only if x = 0. Hence, Var (By) will attain its minimum value o?/n whenever 
x = 0. Of course, this will be impossible in any application in which X is constrained 
to be positive. 

Suppose next that the values x,,..., x, are to be chosen so as to minimize the 
M.S.E. of the estimator 8). Again, the M.S.E. of A, will be equal to Var(A;), as given 
in Eq. (11.2.9). It can be seen from Eq. (11.2.9) that Var(A,) will be minimized by 
choosing the values x,,..., x, so that the value of s° is maximized. If the values 
X1,..-,X, must be chosen from some bounded interval (a, b) of the real line, and if 
n is an even integer, then the value of ie will be maximized by choosing x; =a for 
exactly n/2 values and choosing x; = b for the other n/2 values. If n is an odd integer, 
all the values should again be chosen at the endpoints a and b, but one endpoint must 
now receive one more observation than the other endpoint. 

It follows from this discussion that if the experiment is to be designed so as to 
minimize both the M.S.E. of Bp and the M.S.E. of A;, then the values x,, .. . , x,, Should 
be chosen so that exactly, or approximately, n/2 values are equal to some number c 
that is as large as is feasible in the given experiment, and the remaining values are 
equal to —c. In this way, the value of x will be exactly, or approximately, equal to 0, 
and the value of Re will be as large as possible. 

Finally, suppose that the linear combination 6 = cof + cif; +c, is to be esti- 
mated, where cy #0, and that the experiment is to be designed so as to minimize 
the M.S.E. of 4, that is, to minimize Var(0). For example, if Y is a future observation 
with corresponding predictor x, then we could set cg = 1, cp = x, and c,, = 0 in order 
to make 6 = E(Y|x). In Example 11.2.3, we computed Var(T), where T = 6, as the 
sum of two nonnegative terms in Eq. (11.2.10). The second term is the only one that 
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depends on the values of x;,..., x,, and it equals 0 (its smallest possi value) if 
and only if ¥ = c;/cp. In this case, >, Var(8) will attain its minimum value cio o*/n. 

In practice, an experienced statistician would not usually choose all the values 
X4,...,X, at a single point or at just the two endpoints of the interval (a, b), as the 
optimal designs that we have just derived would dictate. The reason is that when all 
n observations are taken at just one or two values of X, the experiment provides 
no possibility of checking the assumption that the regression of Y on X is a linear 
function. In order to check this assumption without unduly increasing the M.S.E. 
of the least-squares estimators, many of the values x1, ..., x, should be chosen at 
the endpoints a and b, but at least some of the values should be chosen at a few 
interior points of the interval. Linearity can then be checked by visual inspection of 
the plotted points and the fitting of a polynomial of degree two or higher. 


Exercises 


\7 
“ 

Summary 
We considered the following statistical model. The values x,,..., x, are assumed 
known. The random variables Y;,..., Y,, are independent with Y; having the normal 


distribution with mean By + Bx; and variance o”. Here, Bp, B;, and o” are unknown 
parameters. These are the assumptions of the simple linear regression model. Under 
this model, the joint distribution of the least-squares estimators fy and f; is a bivari- 
ate normal distribution with £; having mean ; for i = 1, 2. The variances are given 
in Eqs. (11.2.5) and (11.2.9). The covariance is given in Eq. (11.2.6). If we consider 
predicting a future Y value with corresponding predictor x, we might use the predic- 
tion Y = Bo + Bix. In this case, Y — Y has the normal distribution with mean 0 and 
variance given by Eq. (11.2.11). 


1. Show that the M.L.E. of o7 is given by Eq. (11.2.3). 
2. Show that E(;) = By. 

3. Show that E(B) = Bo. 

4. Show that Var(Ap) is as given in Eq. (11.2.5). 


5. Show that Cov(Ao, A;) is as given in Eq. (11.2.6). Hint: 
Use the result in Exercise 8 in Sec. 4.6. 


6. Show that in a problem of simple linear regression, the 
estimators fy and f, will be independent if ¥ = 0. 


7. Consider a problem of simple linear regression in which 
a patient’s reaction Y to a new drug B is to be related to 
his reaction X to a standard drug A. Suppose that the 10 
pairs of observed values given in Table 11.1 are obtained. 


a. Determine the values of the M.L.E.’s fo, 8), and 67. 
b. Determine the values of Var (Bo) and Var (B;). 


c. Determine the value of the correlation of Ay and A). 


8. Consider again the conditions of Exercise 7, and sup- 
pose that it is desired to estimate the value of 6 = 36) — 
26, +5. Determine an unbiased estimator of 6 and find its 
M.S.E. 


9. Consider again the conditions of Exercise 7, and let 0 = 
3Bp + cB, where cy is a constant. Determine an unbiased 
estimator 6 of 6. For what value of c, will the M.S.E. of 6 
be smallest? 


10. Consider again the conditions of Exercise 7. If a par- 
ticular patient’s reaction to drug A has the value x = 2, 
what is the predicted value of his reaction to drug B, and 
what is the M.S.E. of this prediction? 


11. Consider again the conditions of Exercise 7. For what 
value x of a patient’s reaction to drug A can his reaction 
to drug B be predicted with the smallest M.S.E.? 


12. Consider a problem of simple linear regression in 
which the durability Y of a certain type of alloy is to be 
related to the temperature X at which it was produced. 
Suppose that the eight pairs of observed values given 
in Table 11.3 are obtained. Determine the values of the 
M.L.E.’s Bo, fy, and 6, and also the values of Var(Bo) and 


Var (f}). 


13. For the conditions of Exercise 12, determine the value 
of the correlation of Bp) and f,. 


14. Consider again the conditions of Exercise 12, and sup- 
pose that it is desired to estimate the value of 6 =5 — 
4B + ;. Find an unbiased estimator 6 of 6. Determine 
the value of 6 and the M.S.E. of 6. 


15. Consider again the conditions of Exercise 12, and let 
@ = cf, — Bo, where cj is a constant. Determine an unbi- 
ased estimator 6 of 9. For what value of c, will the M.S.E. 
of @ be smallest? 


16. Consider again the conditions of Exercise 12. Ifa spec- 
imen of the alloy is to be produced at the temperature 
x = 3.25, what is the predicted value of the durability of 
the specimen, and what is the M.S.E. of this prediction? 


17. Consider again the conditions of Exercise 12. For what 
value of the temperature x can the durability of aspecimen 
of the alloy be predicted with the smallest M.S.E.? 


18. Moore and McCabe (1999, p. 174) report prices paid 
for several species of seafood in 1970 and 1980. These 
values are in Table 11.6. If we were interested in trying 
to predict 1980 seafood prices from 1970 prices, a linear 
regression model might be used. 


a. Find the least-squares regression coefficients for pre- 
dicting 1980 prices from 1970 prices. 


b. If an additional species sold for 21.4 in 1970, what 
would you predict for the 1980 selling price? 
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c. What is the M.S.E. for predicting the 1980 price of a 
species that sold for 21.4 in 1970? 


Table 11.6 Fish prices in 1970 and 1980 for 
Exercise 18 


1970 1980 1970 1980 
13.1 27.3 26.7 80.1 
153 42.4 47.5 150.7 
25.8 38.7 6.6 20.3 
1.8 4.5 94.7 189.7 
4.9 23 61.1 131.3 
55.4 166.3 135.6 404.2 
39.3 109.7 47.6 149 


19. In the 1880s, Francis Galton studied the inheritance 
of physical characteristics. Galton found that the sons of 
tall men tended to be taller than average, but shorter than 
their fathers. Similarly, sons of short men tended to be 
shorter than average, but taller than their fathers. Thus, 
the average heights of the sons were closer to the mean 
height of the population, regardless of whether the fathers 
were taller or shorter than average. From these observa- 
tions, one might conclude that the variability of height de- 
creases over successive generations, both tall persons and 
short persons tend to be eliminated, and the population 
“regresses” toward some average height. This conclusion 
is an example of the regression fallacy. In this problem you 
will prove that the regression fallacy arises in the bivari- 
ate normal distribution even when both coordinates have 
the same variance. In particular, assume that the vector 
(X,, X>) has the bivariate normal distribution with com- 
mon mean jz, common variance o?, and positive correla- 
tion p < 1. Prove that E(X>|xj) is closer to 4 than xj is to 
ue for every value x,. (This occurs despite the fact that X, 
and X> have the same mean and the same variance.) 


11.3 Statistical Inference in Simple Linear Regression 


Many of the inference procedures introduced in Chapters 8 and 9 that were used for 
samples from anormal distribution can be extended to the simple linear regression 
model. The theorems that allowed us to conclude that various statistics had t 
distributions will continue to apply in the regression case. 


Joint Distribution of the Estimators 


Example 
11.3.1 


Pressure and the Boiling Point of Water. Consider the traveler in Example 11.2.4, whois 
interested in the barometric pressure when the boiling point of water is 201.5 degrees. 


Suppose that this traveler would like to know whether the pressure is 24.5. For 
example, the traveler might wish to test the null hypothesis Hp : By + 201.56, = 24.5. 
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Theorem 
11.3.1 


Alternatively, the traveler might desire an interval estimate of 8) + 201.56). Such 
inferences are possible once we find the joint distribution of the estimators of all of 
the parameters (6p, 6,, and o”) of the regression model. < 


It was stated after the proof of Theorem 11.2.2 that, in a problem of simple linear 
regression, the joint distribution of the M.L.E.’s Bp and £; is the bivariate normal 
distribution for which the means, the variances, and the covariance are specified 
in Theorem 11.2.2. In this section, we shall prove this fact. We shall also consider 
the M.L.E. 6, which was presented in Eq. (11.2.3), and we shall derive the joint 
distribution of bos Bi, and 6?. In particular, we shall show that the estimator 67 is 
independent of Bo and bi. 

We continue to make Assumptions 11.2.1-11.2.5. The derivation of the joint 
distribution of Bo, 6,, and 62, which we shall present, is based on the properties of 
orthogonal matrices, as described in Sec. 8.3. 

We shall continue to use the definition of s, in Eq. (11.2.4). Also, let a; = 


(a4, +--+» Gy) and ay = (ay, ..., do,) be n-dimensional vectors, which are defined 
as follows: 
1 , 
aj => ie. for J= 1, sone gully (11.3.1) 
and 
1 = . 
a); =—(x;—-xX) forj=1,...,n. (11.3.2) 
Sy 


x 


It is easily verified that )""_, at, = 1, 7_, a3, = 1, and) jy ayjanj = 0. 

Because the vectors a, and a» have these properties, it is possible to construct 
ann x n orthogonal matrix A such that the coordinates of a, form the first row of A, 
and coordinates of a, form the second row of A. (To see how this is done, consult a 
linear algebra text, such as Cullen, 1972, p. 162, for the Gram-Schmidt method.) We 
shall assume that such a matrix A has been constructed: 


a1 A1n 

a a 
A= 7 2n 

Gnt *** Ann 


We shall now define a new random vector Z by the relation Z = AY, where 


% Ze 
Y=| : and Z=| : 
Yn Zn 
The joint distribution of Z,,..., Z,, can be found from the following theorem, which 


is an extension of Theorem 8.3.4. 


Suppose that the random variables Y;,..., Y, are independent, and each has a 
normal distribution with the same variance o”. If A is an orthogonal n x n matrix 
and Z = AY, then the random variables Z;,..., Z,, also are independent, and each 


has a normal distribution with variance o”. 


Proof Let E(Y;) =; fori=1,...,n (itis not assumed in the theorem that Y,,..., 
Y, have the same mean), and let 


Theorem 
11.3.2 


11.3 Statistical Inference in Simple Linear Regression 709 


My 
w=) : 

Mn 
Also, let X¥ = (1/o)(Y — m). Since it is assumed that the coordinates of the random 
vector Y are independent, then the coordinates of the random vector X will also 
be independent. Furthermore, each coordinate of X will have the standard normal 
distribution. Therefore, it follows from Theorem 8.3.4 that the coordinates of the 
n-dimensional random vector AX will also be independent, and each will have the 


standard normal distribution. 
But 


oO Oo oO 
Hence, 
Z=oAX+Au. (11.3.3) 


Since the coordinates of the random vector AX are independent, and each has the 
standard normal distribution, then the coordinates of the random vector cAX will 
also be independent, and each will have the normal distribution with mean 0 and 
variance o”. When the vector Aj is added to the random vector cAX, the mean of 
each coordinate will be shifted, but the coordinates will remain independent, and the 
variance of each coordinate will be unchanged. It now follows from Eq. (11.3.3) that 
the coordinates of the random vector Z will be independent, and each will have a 
normal distribution with variance o?. rT] 

In a problem of simple linear regression, the observations Y;,..., Y,, satisfy the 
conditions of Theorem 11.3.1. Therefore, the coordinates of the random vector Z = 
AY will be independent, and each will have a normal distribution with variance o?. 


We can use these facts to find the joint distribution of (Bo: Bi 6’). 


In the simple linear regression problem described above, the joint distribution of 
(Bo Bi) is the bivariate normal distribution for which the means, variances, and 
covariance are as stated in Theorem 11.2.2. Also, ifn > 3, 67 is independent of (Bo: B D 
and né*/o7 has the x” distribution with n — 2 degrees of freedom. 


Proof The first two coordinates Z,; and Z> of the random vector Z can easily be 
derived. The first coordinate is 


1 n oe 
: i 


Since By = Y — XA}, we may also write 


Z, =n"? (By + XB). (11.3.5) 
The second coordinate is 
nh nh 
1 = 
j=l a 


By Eq. (11.2.7), we may also write 
Z, =5,p3. (11.3.7) 
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Together, Eqs. (11.3.5) and (11.3.7) imply that 
Sx 
(11.3.8) 


Since Z, and Z, are independent normal random variables, they have a bivariate 
normal joint distribution. Eqs. (11.3.8) express Ay and f, as linear combinations of 
Z, and Z,. These linear combinations satisfy the conditions of Exercise 10 of Sec. 5.10, 
which says in turn that By and A, have a bivariate normal distribution. We already 
calculated the means, variances, and covariance in Theorem 11.2.2. 

Now let the random variable S$? be defined as follows: 


S? = (¥; — Bo — B1xi)’. (11.3.9) 


i=1 


(It is easy to see that the M.L.E. of o7, as given in Eq. (11.2.3), is67 = S/n.) We shall 
show that S? and the random vector (fp, A,) are independent. Since By) = Y — xf, 
we may rewrite S” as follows: 


=) -¥ - AG: - DP 


i=l 


= 7, = ¥) = 28) Pe; — HM -Y) + Bis?, 


i=1 i=] 


It now follows from Eq. (11.1.1) that 
. 2 
S°=) 0 ¥? -n¥ —s2Br. (11.3.10) 
i=1 


Since Z = AY, where A is an orthogonal matrix, we know from Theorem 8.3.4 
that )7"_, Y? = 7", Z?. By using this fact, we can now obtain the following relation 
from Eq. (11.3.4), (11.3.7), and (11.3.10): 


n n 
2 2 2 2 2 
ay ZaZa Za) 2) 
i=l i=3 


The random variables Z;,..., Z,, are independent, and we have now shown that 
S* is equal to the sum of the squares of only Z3,..., Z,. It follows, therefore, that 
S? and the random vector (Z,, Z>) are independent. But fy and f, are functions of 
Z, and Z, only, as seen in Eq. (11.3.8). Hence, S? and the random vector (Ap, ,) are 
independent. 

We shall now derive the distribution of S?. For i =3,..., n, we have Z; = 


n 
dj=1 4ij¥;- Hence, 
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E(Z;) =) a;jE()) =~ a;j(Bo + Bix)) 


j=l j=l 
= > a;j[Bo + Bix + Bix; — ¥)] (11.3.1) 
j=l 


n 


= (Bo + Bix) So aij + By >> ajj(x; —X). 


j=l j=l 
Since the matrix A is orthogonal, the sum of the products of the corresponding terms 
in any two different rows must be 0. In particular, fori =3,..., 1, 
n n 
y jj Aj =0 and y, qj j42j; =0. 
j=l j=l 


It now follows from the expressions for a, ; and ay; given in Eqs. (11.3.1) and (11.3.2) 
that fori =3,...,n, 


n n 
Y> aij =0 and S ajj(xj —¥) =0. 
j=l j=l 


When these values are substituted into Eq. (11.3.11), it is found that E(Z;) = 0 for 
re 

We now know that the n — 2 random variables Z3,..., Z, are independent, 
and that each has the normal distribution with mean 0 and variance o”. Since S? = 
y'_; Z?, it follows that the random variable S”/o7 has the x? distribution with n — 2 
degrees of freedom. 

Finally, we know that 6? = S*/n, and hence 6? is independent of the estimators 
fy and f,, and the distribution of né7/o? is the x? distribution with n — 2 degrees of 
freedom. a 


Tests of Hypotheses about the Regression Coefficients 


It will be convenient, for the remainder of the discussion of simple linear regression, 


to let 
92 1/2 
= , (11.3.12) 


This random variable will appear in all of the test statistics and confidence intervals 
that we derive. It is analogous to the random variable with the same name that 
appears in Eqs. (8.4.3) and (8.4.5) and played a similar role in tests and confidence 
intervals for the mean of a single normal distribution. 

We proved earlier that the joint distribution of (Ap, B;) is bivariate normal. This 
implies that every linear combination cop + c;6; has a normal distribution. We shall 
use this fact to simplify the discussion of inference about regression coefficients. We 
shall begin by deriving tests of hypotheses concerning a general linear combination 
coo + c1 $1 of the regression parameters. Then, specific cases will be introduced by 
choosing special values for cg and cy. For example, cy = 1 and c; = 0 makes the linear 
combination £9, while cg = 0 and c; = 1 leads to f). 
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Theorem 
11.3.3 


Tests of Hypotheses about a Linear Combination of By and B, Let co, c;, and c, 
be specified numbers, where at least one of cy and c; is nonzero, and suppose that we 
are interested in testing the following hypotheses: 

Hy: Coby + 1Pi = Cy, 

Ay: Coby + cP A Cy. 
We shall derive a test of these hypotheses based on the random variables cobo + c1By 
and o’. 


(1.3.13) 


For each 0 < ag < 1, a level ap test of the hypotheses (11.3.13) is to reject Hp if 
|Uo1| = 7 a. — ay /2), where 


Boke] (cob teri - 
um=|2 ai a (2 ein *), (11.3.14) 
n 


2 o’ 
x 


and ‘aor is the quantile function of the ¢ distribution with n — 2 degrees of freedom. 


Proof In general, the mean of cobo + ¢1By is coBo + c1H}, and its variance was found 
in Eq. (11.2.10). Therefore, when Ap is true, the following random variable Wo, has 
the standard normal distribution: 


. 7 sq-"2 7» , 
Wo = E é (cox — Gas = *) 


Ss oO 
x 


Because the value of o is unknown, a test of the hypotheses (11.3.13) cannot be based 
simply on the random variable Wo. However, the random variable S*/c7 has the x? 
distribution with n — 2 degrees of freedom for all possible values of the parameters 
Bo, Bi, and o”. Moreover, because (Ap, B;) is independent of S?, it follows that Wo, 
and S? are also independent. Hence, when the hypothesis Hp is true, the random 
variable 


Wor 


()(8)] 


has the ¢ distribution with n — 2 degrees of freedom. It is straightforward to show 
that the expression in (11.3.15) also equals Up, in Eq. (11.3.14), which is a function of 
the observable data alone. It follows that the test specified in the theorem is a level 
ag test of the hypotheses (11.3.13). r 


(11.3.15) 


The test procedure in Theorem 11.3.3 is also the likelihood ratio test procedure 
for the hypotheses (11.3.13), but the proof will not be given here. 


Tests of One-Sided Hypotheses The same derivation just finished can also be used 
to form tests of hypotheses such as 
Hp: ¢ +c = Cy; 
0: CoBo + c1B1 S Cx (1.3.16) 
A, : coBo + C1 Py > Cy, 
or 
Ho: Cobo + ¢1Bi = Cy, 


(11.3.17) 
Ay: coBo + 1 By < Cy. 


Theorem 
11.3.4 
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The proof of the following result is similar to the proof of Theorem 11.3.3 and will 
not be given here. 


A level ag test of (11.3.16) is to reject Hp if Up, = i apaar — ap). A level ag test of 
(11.3.17) is to reject Hp if Up, < —T71,(1 — a). . 


The only part of the proof of Theorem 11.3.4 that differs significantly from the 
corresponding part of Theorem 11.3.3 is the proof that the tests actually have level 
of significance wy. The proof of this is similar to the proof of Theorem 9.5.1 and is left 
to the reader in Exercise 23. 

We shall next present examples of how to test several common hypotheses 
concerning fy and £, by making use of the fact that Ug, in Eq. (11.3.14) has the r 
distribution with n — 2 degrees of freedom. These examples will correspond to setting 
co, Cy, and c,, equal to specific values. 


Tests of Hypotheses about By Let £5 be a specified number (—oo < Bj < 00), 
and suppose that it is desired to test the following hypotheses about the regression 
coefficient Bp: 


Ay: Bo= Bo, 

Ay: Bo # B- 
These hypotheses are the same as those in Eq. (11.3.13) if we make the substitutions 
co = 1, c; =0, and c, = Bj. If we substitute these values into the formula for Up, in 
Eq. (11.3.14), we obtain the following random variable, Uo, 


y= __Po7 Fy (11.3.19) 


ia?) 
om Pe a 
no ss 


which then has the r distribution with n — 2 degrees of freedom if Hp is true. 

Suppose that in a problem of simple linear regression, we are interested in testing 
the null hypothesis that the regression line y = By + 6,x passes through the origin 
against the alternative hypothesis that the line does not pass through the origin. These 
hypotheses can be stated in the following form: 


Ho: Bo = 0, 


Hy: By #0. 
Here the hypothesized value £5 is 0. 
Let up denote the value of Up calculated from a given set of observed values (x;, 
y,;) fori =1,...,n. Then the tail area (p-value) corresponding to this value is the 
two-sided tail area 


(11.3.18) 


(1.3.20) 


Pr(Up 2 |uol) + Pr(Up < —|wol). 


For example, suppose that n = 20 and the calculated value of Up is 2.1. It is found 
from a table of the ¢ distribution with 18 degrees of freedom that the corresponding 
tail area is 0.05. Hence, at each level of significance ag < 0.05, the null hypothesis Ho 
would not be rejected. At every level of significance ap > 0.05, Hy would be rejected. 


Tests of Hypotheses about 6, Let fy be a specified number (—oo < f < 00), 
and suppose that it is desired to test the following hypotheses about the regression 
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Example 
11.3.2 


Figure 11.8 Plot of gallons 
per mile versus engine 
horsepower for 173 cars in 
Example 11.3.2. The least- 
squares regression line is 
drawn on the plot. 


coefficient A: 
Ay: B, = By, 
Ay: py A BF. 
These hypotheses are the same as those in Eq. (11.3.13) if we make the substitutions 


co = 0, cy = 1, and c, = By. If we substitute these values into the formula for Up, in 
Eq. (11.3.14), we obtain the following random variable, Uj, 


R _ ax 
(oy oe (1.3.22) 


o’ 


(1.3.21) 


which then has the ¢ distribution with n — 2 degrees of freedom if Hp is true. 

Suppose that in a problem of simple linear regression we are interested in testing 
the hypothesis that the variable Y is actually unrelated to the variable X. Under 
Assumptions 11.2.1-11.2.5, this hypothesis is equivalent to the hypothesis that the 
regression function E(Y|x) is constant and not actually a function of x. Since it is 
assumed that the regression function has the form E(Y |x) = By + 61x, this hypothesis 
is in turn equivalent to the hypothesis that 6, = 0. Thus, the problem is one of testing 
the following hypotheses: 


Hy: Bi =0, 


Ay : By x 0. 
Here the hypothesized value 6} is 0. 
Let u, denote the value of U; calculated from a given set of observed values (x ;, 
y;) fori =1,...,. Then the p-value corresponding to these data is 


Pr(U, = |uy|) + Pr(U; < —|u})). 


Gasoline Mileage. Consider the two variables gasoline mileage and engine horse- 
power in Example 11.1.4. This time, let Y be 1 over gasoline mileage, that is, gal- 
lons per mile. Also, let X be engine horsepower. A plot of the observed (4;, y,) 
pairs is given in Fig. 11.8 together with the fitted least-squares regression line. No- 
tice how much straighter the relationship is between the two variables in Fig. 11.8 
than between the two variables in Fig. 11.6. The least-squares estimates for a sim- 
ple linear regression of gallons per mile on engine horsepower are By = 0.01537 and 
B, = 1.396 x 10-4. Also, o! = 7.181 x 1073, ¥ = 183.97, and s, = 1036.9. Suppose that 


Gallons per mile 


T T 
9 100 200 300 400 
Engine horsepower 


Example 
11.3.3 
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we wanted to test the null hypothesis Hp : 6; > 0 against the alternative H, : B, < 0. 
The observed value of the statistic U, in Eq. (11.3.22) is 


—4 
st, = 1036/9220 * 10 = 9 _ 99 15 
7.139 x 10-3 
which is larger than the 1 — 10~!° quantile of the f distribution with 171 degrees of 
freedom. So, we would reject Hp at every level a < 107!°. < 


Tests of Hypotheses about the Mean of a Future Observation Suppose that we 
are interested in testing the hypothesis that the regression line y = Bp + 61x passes 
through a particular point (x*, y*), where x* 40. In other words, suppose that we 
are interested in testing the following hypotheses: 


Hy: Bo+ Bix*=y"*, 
Ay: Bot Byx* Fy". 
These hypotheses have the same form as the hypotheses (11.3.13) with cy = 1, cy = x*, 


and c,, = y*. Hence, they can be tested by carrying out at test with n — 2 degrees of 
freedom that is based on the statistic Up). 


Pressure and the Boiling Point of Water. In Example 11.3.1, the traveler was interested 
in testing the null hypothesis that Hp : By + 201.56, = 24.5 versus Hj : By + 201.58; A 
24.5. We shall make use of the statistic Up; in Eq. (11.3.14) with cg =1 and c, = 
201.5. Based on the data in Table 11.5, we have already computed the least-squares 
estimates By = —81.06 and A, = 0.5229. We can also compute n = 17, s2 = 530.78, 
xX = 202.95, and o’ = 0.2328. Then 


0.2204. 


~ 0.2328 


1/2 
u,, —| 2 4 (202.95- 201.52] —81.06 + 201.5 x 0.5229 — 24.5 : 
ae Ie 530.78 = 


If Hp is true, then Up; has the ¢ distribution with n — 2 = 15 degrees of freedom. The 
p-value corresponding to the observed value —0.2204 is 0.8285. The null hypothesis 
would be rejected at level ap only if ap > 0.8285. J 


Confidence Intervals 


A confidence interval for 6, 6,, or any linear combination of the two can be obtained 
from the corresponding test procedure. 


Let cg and c; be scalar constants that are not both 0. The open interval between the 
two random variables 


2 _ a2 

A A Cc = 

coBy + cB, £0’ E 4. (om c1) en (1 = <0) (11,3:23) 

n Ss 
x 


is a coeficient 1 — a, confidence interval for cpp + c1A4. 


Proof Consider the general hypotheses (11.3.13). Theorem 9.1.1 tells us that the set 
of all values of c,, for which the null hypothesis Hj would not be rejected at the level of 
significance ap forms a confidence interval for coB9 + c,h, with confidence coefficient 
1 — ap. It is straightforward to check that c,, is between the two random variables in 
(11.3.23) if and only if |U9;| < samara — 9/2), which specifies when the level wp) would 
not reject Hy according to Theorem 11.3.3. a 
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Theorem 
11.3.6 


Gasoline Mileage. In Example 11.3.2, we rejected the null hypothesis that 6, < 0, but 
we might wish to form an interval estimate of 6,. Apply Theorem 11.3.5 with cg = 0 
and c; = 1. The endpoints of a coefficient 1 — ag confidence interval are then 


1 
“a oO 1 ao 
By a a (1 = 0) . 


Sx 


For example, suppose that we desire a coefficient 0.8 confidence interval for 6,. We 
find 7,71 (0.9) = 1.287 using computer software (or we could have interpolated in the 
table in the back of the text). The remaining values needed to compute the endpoints 
are given in Example 11.3.2, and the observed interval is (1.307 x 10-4, 1.485 x 1074). 

< 


Other special cases of Theorem 11.3.5 are when cp = 1 and c, = 0, which provides 
a confidence interval for 6), and when cg = 1 and c, = x, which provides a confidence 
interval for the mean of Y when X = x. The second of these can also be described as 
the height 6 = By + 1x of the regression line at a given point x. The corresponding 
confidence interval has the endpoints 


1/2 
= ee 
A+ bx +77), (1 <0) o! F +& = (1.3.24) 


n Ss 
x 


Prediction Intervals On page 703, we discussed predicting a new Y value (indepen- 
dent of the observed data) when we knew the corresponding value of x. Suppose that 
we want an interval that should contain Y with some specified probability 1 — ap. We 
can construct such an interval by considering the joint distribution of Y, ¥Y = By + Bx, 
and S?. 


In the simple linear regression problem, let Y be a new observation with predictor 


x such that Y is independent of Y;,..., Y,,. Let Y= Bo + Bix. Then the probability 
that Y is between the following two random variables is 1 — ap: 
1/2 
: _x)2 
Part, (1-2) of1¢24 S57 ; (11.3.25) 
ae 2 n se 


Proof Since Y is independent of the observed data, we have that Y, Y, and S? are all 
independent. Hence, the following two random variables are independent: 


¥=F 2 


gue? Wea 

1 («—-*) 

o Le 
n s 


x 


Since Y and ¥ are independent and normally distributed, Z has a normal distribution. 
Since E(Y) = E(Y ), the mean of Z is 0. It follows from Eq. (11.2.13) that the variance 
of Z is 1. It follows from Theorem 11.3.2 that W has the x? distribution with n — 2 
degrees of freedom. It follows that Z/(W/[n — 2])!/” has the t distribution with n — 2 
degrees of freedom. It is easy to see that Z/(W/[n — 2])'/? is the same as 


U, = ioe! >: (11.3.26) 


a 
ab + 1 + ea 
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It follows that Pr(|U,.| < Toa — ag/2)) = 1— qp. It is then straightforward to show 


that Y is between the two random variables in (11.3.25) if and only if |U,| < paar = 


ao/2). a 


Prediction Interval. The random interval whose endpoints are given by (11.3.25) is 
called a coefficient 1 — ag prediction interval for Y. 


Prior to observing the data, when o’, fp, B,, and Y are all still random variables, 
the endpoints in (11.3.25) have the property that the probability is 1 — ag that Y will 
be between the endpoints, and hence in the interval. After the data are observed, 
the interpretation of the interval whose endpoints are in (11.3.25) is similar to the 
interpretation of a confidence interval, but with the added complication that Y is still 
a random variable. 


Gasoline Mileage. Suppose that we wish to predict the gasoline mileage for a car with 
a particular engine horsepower x in Example 11.3.2. In particular, let x = 100, and 
we shall use a = 0.1 to form a prediction interval as above. Using the values com- 
puted in Example 11.3.2 and Eq. (11.3.25), we obtain the interval (0.01737, 0.04127) 
for predicting Y gallons per mile. Since Y is in this interval if and only if 1/Y is 
between 1/0.01737 = 57.56 and 1/0.04127 = 24.23, we can claim that the following 
interval is the observed value of a 90 percent prediction interval for miles per gallon: 
(24.23, 57.56). < 


The Analysis of Residuals 


Whenever a statistical analysis is carried out, it is important to verify that the ob- 
served data appear to satisfy the assumptions on which the analysis is based. For 
example, in the statistical analysis of a problem of simple linear regression, we have 
assumed that the regression of Y on X is a linear function and that the observations 
Y,,..., Y, are independent. The M.L.E.’s of 6) and £, and the tests of hypotheses 
about fp and f, were developed on the basis of these assumptions, but the data were 
not examined to find out whether or not these assumptions were reasonable. 

One way to make a quick and informal check of these assumptions is to examine 


the discrepancies between the observed values y,,..., y, and the fitted regression 
line. 

Residuals/Fitted Values. For i =1,...,n, the observed values of $, = fy + A x; are 
called the fitted values. Fori =1,...,n, the observed values of e; = y; — 3; are called 


the residuals. 


Specifically, suppose that the n points (x;, e;), fori =1,..., are plotted in the 
xe-plane. It must be true (see Exercise 4 at the end of Sec. 11.1) that )~"_, e; =0 
and )>"_, x;e; = 0. However, subject to these restrictions, the positive and negative 
residuals should be scattered randomly among the points (x,;, e;). If the positive 
residuals e; tend to be concentrated at either the extreme values of x; or the central 
values of x;, then either the assumption that the regression of Y on X is a linear 
function or the assumption that the observations Y;, ..., Y,, are independent may be 
violated. In fact, if the plot of the points (x;, e;) exhibits any type of regular pattern, 
the assumptions may be violated. 


Pressure and the Boiling Point of Water. The residuals from a least-squares fit to the 
data in Example 11.2.2 can be computed using the coefficients reported in Exam- 
ple 11.2.5: By = —81.06 and f; = 0.5229. Table 11.7 contains the original data together 
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Table 11.7 Data from Table 11.5 together with fitted values, residuals 
from least-squares fit, and logarithm of pressure 

Xj y; J; = —81.06 + 0.5229x; e;=y,-—3;  log(y,) 
194.5 20.79 20.64 0.1512 3.034 
194.3 20.79 20.53 0.2557 3.034 
197.9 22.40 22.42 —0.0167 3.109 
198.4 22.67 22.68 —0.0081 3.121 
199.4 23.15 23.20 —0.0510 3.142 
199.9 23.35 23.46 —0.1125 3.151 
200.9 23.89 23.99 —0.0954 3.173 
201.1 23.99 24.09 —0.0999 3.178 
201.4 24.02 24.25 —0.2268 3.179 
201.3 24.01 24.19 —0.1845 3.178 
203.6 25.14 25.40 —0.2572 3.224 
204.6 26.57 25.92 0.6499 3.280 
209.5 28.49 28.48 0.0078 3.350 
208.6 27.76 28.01 —0.2516 3.324 
210.7 29.04 29.11 —0.0697 3.369 
211.9 29.88 29.74 0.1428 3.397 
212.2 30.06 29.89 0.1660 3.403 


with the fitted values }; = —81.06 + 0.5229x; and the residuals e; = y; — J, for alli. 
A plot of the residuals versus boiling point is shown in Fig. 11.9. This plot has two 
striking features. One is the exceptionally large positive residual corresponding to 
x; = 204.6 at the top of the plot. Observations with such large residuals are sometimes 
called outliers. Perhaps either the x; or y; value corresponding to this observation was 
recorded incorrectly or this observation was taken under conditions different from 
those of the other observations. Or perhaps that particular y; value just happened to 
be very far from its mean. The other striking feature of the plot is that, aside from 
the outlier, the other residuals seem to form a U-shaped pattern. This sort of pattern 
suggests that the relationship between the two variables might be better described 
by acurve rather than a straight line. 

Techniques for dealing with the two features that we noticed in Fig. 11.9 can 
be found in books devoted to regression methodology such as Belsley, Kuh, and 
Welsch (1980), Cook and Weisberg (1982), Draper and Smith (1998), and Weisberg 
(1985). One possible technique to deal with the curved look of the residual plot is to 
transform one or both of the two variables Y and X before performing the regression. 
Indeed, Forbes (1857) suspected that the logarithm of pressure would be linearly 
related to boiling point. Table 11.7 also contains the logarithms of pressure. If we 
perform a regression of the logarithm of pressure on the boiling point, we obtain 
the least-squares estimates Bo = —0.9709 and a = 0.0206. The observed value of 
o’ is 8.730 x 10~. Residuals from this fit can be computed as log(y;) — (—0.9709 + 
0.0206x;), and they are plotted in Fig. 11.10. The one large residual still appears in 


Figure 11.9 Plot of resid- 
uals versus boiling point for 
Example 11.3.6. 


Figure 11.10 Plot of resid- 
uals from regression of log- 

pressure versus boiling point 
for Example 11.3.6. 
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Fig. 11.10, but the curved shape of the remaining residuals has vanished. To see what 
effect that one observation has on the regression, we can fit the regression using only 
the other 16 observations. In this case, the estimated coefficients are fy = —0.9518 
and £, = 0.0205 with o’ = 2.616 x 107%. The coefficients don’t change much, but the 
estimated standard deviation drops to less than one-third of its previous value. < 


Note: Both Models Cannot Be Correct in Example 11.3.6. It cannot be the case 
that both the mean of pressure and the mean of the logarithm of pressure are linear 
functions of boiling point. When the residual plot in Fig. 11.9 revealed a curved shape, 
we began to suspect that the mean of pressure was not a linear function of boiling 
point. In this case, the probabilistic calculations performed in Examples 11.2.2, 11.2.5, 
and 11.3.3 become suspect as well. 


Note: What to Do with Outliers. The data point with X = 204.6 in Example 11.3.6 
makes it difficult to interpret the results of the regression analysis. Forbes (1857) 
labels this point “Evidently a mistake.” Generally, when such data points appear 
in our data sets, we should try to verify whether they were collected under the 
same conditions as the remaining data. Sometimes the process by which the data 
are collected changes during the experiment. If the removal of the outlier makes a 
noticeable difference to the analysis, then that observation must be dealt with. If it 
is not possible to show that the observation should be removed based on how it was 
collected, it might be that the distribution of the Y; values is different from a normal 
distribution. It might be that the distribution has higher probability of producing 
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11.3.7 


extremely large deviations from the mean. In this case, one might have to resort to 
robust regression procedures similar to the robust procedures described in Sec. 10.7. 
Interested readers should consult Hampel et al. (1986) or Rousseeuw and Leroy 
(1987). 


Normal Quantile Plots Another plot that is helpful in assessing the assumptions of 
the regression model is the normal quantile plot, sometimes called a normal scores 
plot or a normal Q-Q plot. Assume that the residuals are reasonable estimates of 
€; = Y; — (Bo + B1x;). Each ¢; has the normal distribution with mean 0 and variance 
o” according to the linear regression model. The normal quantile plot compares 
quantiles of a normal distribution with the ordered values of the residuals. We 
expect about 25 percent of the residuals to be below the 0.25 quantile of the normal 
distribution. We expect about 80 percent of the residuals to be below the 0.8 quantile 
of the normal distribution, and so forth. We can see how closely these expectations 
are met by plotting the ordered residuals against quantiles of the normal distribution. 

Letr; <r. <--- <r, be the residuals ordered from smallest to largest. The points 
that we plot are (oli /[n +1), 7;) fori =1,...,n, where ©! is the standard 
normal quantile function. The numbers ®~!(i/[n + 1]) fori =1,...,n aren quantiles 
of the standard normal distribution that divide the standard normal distribution 
into intervals of equal probability, including the intervals below the first quantile 
and above the last one. If the plotted points lie roughly along the line y = x, then 
roughly 25 percent of the residuals lie below the 0.25 quantile of the standard normal 
distribution, and roughly 80 percent of the residuals lie below the 0.8 quantile, and 
so on. If the points lie on a different line y = ax + b, then we could multiply the first 
coordinate of each point by a and add b to the first coordinate. This would make the 
new points lie on the line y = x, and the first coordinate of each point is now a quantile 
of the normal distribution with mean b and variance a*. So roughly 25 percent of 
the residuals lie below the 0.25 quantile of the normal distribution with mean b and 
variance a”, and so on. So, we examine the normal quantile plot to see how close the 
points are to lying on a straight line. We don’t care which line it is, because we only 
care whether the data look like they come from some normal distribution. We fit the 
regression model to help decide which normal distribution. 


Pressure and the Boiling Point of Water. As an illustration of the normal quantile 
plot, we deleted the troublesome observation (number 12) from the data set of 
Example 11.3.6 and fit the model in which the logarithm of pressure is regressed 
on the boiling point. The resulting normal quantile plot is shown in Fig. 11.11. The 
points in Fig. 11.11 lie roughly on a line, although it is not difficult to detect some 
curvature in the plot. It is usually the case that the extreme residuals (lowest and 
highest) do not line up well with the others, so one normally pays closest attention 
to the middle of the plot. Extreme observations that fall very far from the pattern 
of the others suggest a more serious problem. Outliers will typically show up in this 
way as well as in the other residual plots. <l 


If we know the order in which the observations were taken, there are some 
additional plots that can help reveal whether there is some dependence between 
the observations. We will introduce these plots when we discuss multiple regression 
later in this chapter. Readers desiring a deeper understanding of graphics associated 
with linear regression should read Cook and Weisberg (1994). 


Figure I 1.11 Normal quan- 
tile plot for regression of 
log-pressure on boiling point 
with observation number 12 
removed. 
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Inference about Both 6, and 6, Simultaneously 


Tests of Hypotheses about Both By and 6, Suppose next that 6} and 6; are given 
numbers and that we are interested in testing the following hypotheses about the 
values of fp and f;: 


Ay: Bo=Bji and Bp, = fF, 


: ; (11.3.27) 
H,: The hypothesis Hp is not true. 


These hypotheses are not a special case of (11.3.13); hence, we shall not be able to test 
these hypotheses using Up; from Eq. (11.3.14). Instead, we shall derive the likelihood 
ratio test procedure for the hypotheses (11.3.27). 

The likelihood function f,,(y|x, Bo, 61, o7) is given by Eq. (11.2.2). We know from 
Sec. 11.2 that the likelihood function attains its maximum value when fp, 6;, and 0? 
are equal to the M.L.E.’s Bo, f;, and 62, as given by Eq. (11.1.1) and Eq. (11.2.3). 

When the null hypothesis Hp is true, the values of Bp and 6, must be Bj and £7, 


respectively. For these values of Bp and f;, the maximum value of f,(y|x, Bj, By. o°) 
over all the possible values of o? will be attained when o7 has the following value ae 


: 1 n 
5 = Fe XxGr _ By = prey. 
i=1 


Now consider the statistic 


Sup,2 fal ylx, Bo> By. o”) 
SUP 65, 61,02 Srl ylx, Bo, Bi o?) 


A(y|x) = 


By using the results that have just been described, it can be shown that 


ge 10% — Bo — Bixy” a 
A(ylx) =| =| =, — a : 
To ye -10% _ Bo _ Byx;) 


(1.3.28) 
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The denominator of the final expression in Eq. (11.3.28) can be rewritten as follows: 


S01 — Bj — Bixd? 
= (1.3.29) 


n 
= >“[0; — Bo — Bixi) + Bo — BS) + (61 — BD xiF. 
i=1 
To simplify this expression further, let the statistic S* be defined by Eq. (11.3.9), and 
let the statistic Q? be defined as follows: 


n 
Q* = n(By — By)” + (9) x7 ) (Bi - BL? 
° dX (11.3.30) 
+ 2nk (Bo — By)(B1 — Bt): 
We shall now expand the right side of Eq. (11.3.29) and use the following relations, 
which were established in Exercise 4 of Sec. 11.1: 


n n 
YG; — Bo — Bix) =0 and > x;(y; — Bo — Bis) =0. 
i=l i=1 
We then obtain the relation 


n 
> 0; — B} — Bix’ = 8? + Q’. 
i=1 
It now follows from Eq. (11.3.28) that 


n/2 —n/2 
s2 2 


The likelihood ratio test procedure specifies rejecting Hy when A(y|x) <k. It can 
be seen from Eq. (11.3.31) that this procedure is equivalent to rejecting Hy) when 
Q?/S* > k’, where k’ is a suitable constant. To put this procedure in a more standard 
form, we shall let the statistic U? be defined as follows: 


1j2 
_ 32 


U2 —. 
o’ 


(1.3.32) 


Then the likelihood ratio test procedure specifies rejecting Hy when U* > y, where 
y 1s a suitable constant. 

We shall now determine the distribution of the statistic U? when the hypothesis 
Hp is true. It can be shown (see Exercises 7 and 8) that when Hp is true, the random 
variable Q*/o* has the x? distribution with two degrees of freedom. Also, because 
the random variable S$? and the random vector (po, 6;) are independent, and because 
Q? is a function of Bo and Bi. it follows that the random variables Q? and S? are 
independent. Finally, we know that S”/o7 has the x? distribution with n — 2 degrees of 
freedom. Therefore, when Hp is true, the statistic U* defined by Eq. (11.3.32) will have 
the F distribution with 2 and n — 2 degrees of freedom. Since the null hypothesis Hp is 
rejected if U? > y, the value of y corresponding to a specified level of significance a 
(0 < a < 1) will be the 1 — ag quantile of this F distribution, namely, F 2, aril — a). 


Joint Confidence Set Next, consider the problem of constructing a confidence set 
for the pair of unknown regression coefficients 6) and 6,. Such a confidence set can 
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be obtained from the statistic U? defined by Eq. (11.3.32), which was used to test the 
hypotheses (11.3.27). Specifically, let F er (1 — a) be the 1 — a quantile of the F 
distribution with 2 and n — 2 degrees of ‘freedom. Then the set of all pairs of values 
of Bj and B; such that U 22 3 coll — ap) will form a confidence set for the pair 
(Bo, 61) with confidence coefficient 1 — ag. It can be shown (see Exercise 16) that 
this confidence set will contain all the points (Bo, 6) inside a certain ellipse in the 
{o6,-plane. In other words, this confidence set will actually be a confidence ellipse. 

The confidence ellipse that has just been derived for By and £,; can be used to 
construct a confidence set for the entire regression line y = By + 6x. Corresponding 
to each point (9, 6) inside the ellipse, we can draw a straight line y = By + Bix in 
the xy-plane. The collection of all these straight lines corresponding to all points (6g, 
£;) inside the ellipse will be a confidence set with confidence coefficient 1 — a for 
the actual regression line. A rather lengthy and detailed analysis, which will not be 
presented here [see Scheffé (1959, section 3.5)], shows that the upper and lower limits 
of this confidence set are the curves defined by the following relations: 


1/2 
A n = 1. x= z)- 
y = By + Bix + [2F7),_,0 — a)]'70" E rs oe] . (113.33) 
x 


In other words, with confidence coefficient 1— a, the actual regression line y = 
Bo + Bx will lie between the curve obtained by using the plus sign in (11.3.33) and 
the curve obtained by using the minus sign in (11.3.33). The region between these 
curves is often called a confidence band or confidence belt for the regression line. 

In similar fashion, the confidence ellipse can be used to construct simultaneous 
confidence intervals for every linear combination of $y) and /;. The coefficient 1 — ag 
interval for cof) + c, 6, has the endpoints 


2 _ 2112 
cobo + cry £0" E 2. aise] [277 2c = an) (113.34) 
n AY , 
x 
This differs from the individual confidence interval given in Eq. (11.3.23) solely in 
the replacement of the 1 — ap/2 quantile of the ,,_» distribution by the square root of 
2 times the 1 — ap quantile of the F ,,_ distribution. The simultaneous intervals are 
wider than the individual intervals because they satisfy a more restrictive require- 
ment. The probability (prior to observing the data) is 1 — ag that all of the intervals 
of the form (11.3.34) simultaneously contain their corresponding parameters. Each 
interval of the form (11.3.23) contains its corresponding parameter with probability 
1— ag, but the probability that two or more of them simultaneously contain their 
corresponding parameters is less than 1 — ap. 


Alternative Tests and Confidence Sets The hypotheses (11.3.27) are a special case 
of (9.1.26), and they can be tested by the same method outlined immediately after 
(9.1.26). The resulting test leads to an alternative confidence set for the pair (Bo, f}). 
The alternative level wp test of (11.3.27) merely combines the two level ay/2 tests of 
(11.3.20) and (11.3.21). To be specific, the alternative level a test 6 of (11.3.27) is to 
reject Ho if either 
-1 a -1 baal 

|Up| = T,_, (1 — “) or |U;| > T,_, (1 _ “0) or both, (11.3.35) 
where Up and U; are, respectively, the statistics in (11.3.19) and (11.3.22) that would 
be used for testing (11.3.20) and (11.3.21). 
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Figure 11.12 Elliptical and rectangular joint coefficient 0.95 
confidence sets for (89, 6,) in Example 11.3.8. 


The corresponding joint confidence set for (Bp, 81) is the set of all (65, 67) pairs 


such that both |U,| and |U;| are strictly less than ‘ame — ao/4). This alternative 
confidence set will be rectangular in shape rather than elliptical. This confidence 
rectangle also provides simulaneous confidence intervals for all linear combinations 
of the form coo + c,h. The formulas for the endpoints are not so pretty as (11.3.34). 
Let C be the joint confidence rectangle. Then the confidence interval for cp Bp) + c1 By 
is the following: 


( inf CoB5 + 1B, sup Cobo + afi) f (11.3.36) 
(85 Bye 


(By -By)EC +, BY)EC 


The sup and inf will each occur at one of the four corners of the rectangle, so one 
need only compute four values of coBj + c,h; to determine the interval. Some special 
cases are worked out in Exercise 24. 


Pressure and the Boiling Point of Water. In Examples 11.2.1 and 11.2.2, we computed 
the least-squares estimates and the variances and covariance of the estimates. Fig- 
ure 11.12 shows both the elliptical and the rectangular coefficient 0.95 joint confi- 
dence sets for the pair (69, 6;). If all that we wanted were confidence intervals for 
the two parameters, we could extract those from both confidence sets. For the ellipti- 
cal region, (11.3.34) gives the intervals (—1.0149, —0.8886) and (0.020207, 0.020830) 
for Bp and f;, respectively. Notice that the endpoints of these intervals are, respec- 
tively, the minimum and maximum values of 6p and £;, in the elliptical joint confidence 
set in Fig. 11.12. Similarly, the joint confidence intervals from the rectangular joint 
confidence set are, respectively, (—1.0097, —0.8938) and (0.020233, 0.020804), whose 
endpoints are also the minimum and maximum values of fp and f, in the rectangular 
joint confidence set in Fig. 11.12. 

Finally, suppose that, in addition to confidence intervals for the two parameters 
fo and £,, we also want a confidence band for the regression function, namely, the 
mean log-pressure at all temperatures x. This mean is of the form cpp + c16; with 
co = Land c, = x. The confidence bands are plotted in Fig. 11.13 based both on the 
elliptical and rectangular joint confidence sets. For example, at x = 201.5, we get the 
intervals (3.1809, 3.1846) and (3.0672, 3.2983) from the elliptical and rectangular sets, 
respectively. 


Theorem 
11.3.7 


Example 
11.3.9 
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Figure 11.13 Coefficient 0.95 confidence bands for the regres- 
sion function in Example 11.3.8. Bands are computed based both 
on the elliptical and on the rectangular joint confidence sets. 


The joint confidence intervals for the two individual parameters are slightly 
shorter when computed from the rectangular confidence set compared to the ellipti- 
cal set. But the confidence band for the regression function (Fig. 11.13) is much wider 
when computed from the rectangular set compared to the elliptical set. <i 


In Example 11.3.8, if one were interested solely in simultaneous confidence 
intervals for the three parameters Bp, 6;, and By + 201.58), instead of the entire 
regression function, one could obtain shorter intervals from a generalization of 
the rectangular joint confidence set. The generalization is based on the Bonferroni 
inequality from Theorem 1.5.8. 


Suppose that we are interested in forming simultaneous confidence intervals for 
several parameters 6),...,6,.Foreachi, let (A;, B;) be a coefficient 1 — a; confidence 
interval for 6;. Then the probability that all n confidence intervals simultaneously 
cover their corresponding parameters is at least 1 — }>?_, a;. 


Proof For eachi =1,...,n, define the event F; = {A; < 6; < B;}. Because (A;, B;) 
is a coefficient 1 — @; confidence interval for 6;, we have Pr(E*) <a; for every i, 
and the probability that all m intervals simultaneously cover their corresponding 
parameters is Pr (()j_, E;). By the Bonferroni inequality, this last probability is at 
least 1 — )77_, a. a 


Theorem 9.1.5 gives the corresponding result for a test of the joint hypotheses 
Ao :6; = 6; for alli , Hy: not Ap, (11.3.37) 


If we want simultaneous coefficient 1 — ap confidence intervals for three param- 
eters, let a; = a/3. 


Pressure and the Boiling Point of Water. Suppose that we are interested solely in si- 
multaneous coefficient 0.95 confidence intervals for the three parameters fo, 61, 
and fp + 201.56; in Example 11.3.8. Then we can use coefficient 1 — 0.05/3 = 0.9833 
confidence intervals for each parameter. The necessary quantile of the ¢ distribu- 
tion is T,,|(0.9917) = 2.7178. The three intervals for Bo, 6,, and By + 201.56, are 
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(—1.0146, —0.8889), (0.020296, 0.020828), and (3.1809, 3.1845), respectively. Notice 
that these are all shorter than the corresponding intervals based on the elliptical 
joint confidence set. The first two of these intervals are longer than the correspond- 
ing intervals from the rectangular joint confidence set in Example 11.3.8, but the 
third interval is much shorter than the corresponding interval based on that same 
rectangular set. < 


Finally, there is a way to construct a narrower confidence band for the entire re- 
gression function based on the Bonferroni inequality, but we leave the details to 
Exercise 25. 

So, which confidence intervals should one use? Also, which test of (11.3.27) 
should one use? None of the tests that we have constructed are uniformly most pow- 
erful. Some are more powerful at some alternatives, while others are more powerful 
at other alternatives. The test corresponding to the rectangular joint confidence set is 
more powerful than the elliptical test if either 8p or A; is a little larger or smaller than 
its hypothesized value while the other parameter is close to its hypothesized value. 
The elliptical test is more powerful than the rectangular test if both By and 6, are 
a little different from their hypothesized values, even if neither is far enough away 
to cause the rectangular test to reject. Without any specification of which alterna- 
tives are most important to detect, one might choose the elliptical test. On the other 
hand, if one’s sole need is for a few confidence intervals and not a confidence band 
for the entire regression function, the intevals based on the Bonferroni inequality 
will generally be shorter. The different tests and confidence intervals differ solely by 
which quantiles are used in their construction. The larger the quantile, the longer the 
confidence interval. Table 11.8 gives the quantiles needed for the intervals based on 
the elliptical joint confidence set (which do not depend on how many intervals one 
constructs) and the quantiles needed for various numbers of intervals based on the 
Bonferroni inequality. One can see that the Bonferroni intervals will generally be 
shorter if one wants only three or fewer. 


\7 
“ 
Summary 
For constants cg and c; that are not both 0, we saw that 
Bot — ey? |” coy + crbi — (Cobo + 1B) 
[Br sar) oPo [ey or C1P1 (11.3.38) 
S Oo 
x 


has the ¢ distribution with n — 2 degrees of freedom under the assumptions of simple 
linear regression. We can use the random variable in (11.3.38) to test hypotheses 
about or to construct confidence intervals for Bp, 6,;, and linear combinations of the 
two. We also learned how to form a prediction interval for a future observation Y 
when the corresponding value for X is known. 

Tests about both £) and A; simultaneously are based on the statistic U? in 
Eq. (11.3.32), which has the F distribution with 2 and n — 2 degrees of freedom 
when the null hypothesis Hp in Eq. (11.3.27) is true. A confidence band for the en- 
tire regression line y = By + 6x (a collection of confidence intervals, one for each x, 
such that all of the intervals simultaneously cover the true values of By + 6 ,x with 
probability 1 — ag) is given by Eq. (11.3.33). The intervals in the confidence band are 
slightly wider than the individual confidence intervals with each separate x. 


11.3 Statistical Inference in Simple Linear Regression 


Table 11.8 Comparison of the quantiles needed to compute k 
simultaneous joint confidence intervals based on the 


Bonferroni inequality and based on the elliptical joint 


confidence set 


T1(1 — a/[2k)) 


ao on k=1 k=2 k=3 k=4 [2F/!,0-ag)]? 


0.05 5 3.18 418 486 5.39 4.37 
10 2.31 2.75 3.02 3.21 2.99 

15 2.16 2.53 2.75 2.90 2.76 

20 2.10 245 264 2.77 2.67 

60 2.00 2.30 247 2.58 2.51 

120 1.98 2.27 243 2.54 2.48 

co «61.96 2.24 240 2.50 2.45 

0.01 5 5.84 745 858 9.46 7.85 
10 3.36 «63.83 412 4.33 4.16 

15 3.01 3.37 3.58 3.73 3.66 

20 2.88 3.20 3.38 3.51 3.47 

60 2.66 2.92 3.06 3.16 3.16 

120 2.62 2.86 3.00 3.09 3.10 

coo «2.58 281 2.94 3.03 3.04 


127 


It is good practice to plot residuals from a regression against the predictor X. 
Such plots can reveal evidence of departures from the assumptions that underly 
the distribution theory developed in this section. In particular, one should look for 
patterns and unusual points in the plot of residuals. Plots of residuals against X 
help reveal departures from the assumed form of the mean of Y. Plots of sorted 
residuals against normal quantiles help reveal departures from the assumption that 
the distribution of each Y; is normal. 


Exercises 


1. Suppose that in a problem of simple linear regres- 
sion, the 10 pairs of observed values of x; and y; given 
in Table 11.9 are obtained. Test the following hypotheses 


at the level of significance 0.05: 


Ho 7 Bo = 0.7, 
A, : Bo # 0.7. 


2. For the data presented in Table 11.9, test at the level 
of significance 0.05 the hypothesis that the regression line 
passes through the origin in the xy-plane. 


3. For the data presented in Table 11.9, test at the level 
of significance 0.05 the hypothesis that the slope of the 
regression line is 1. 


Table 11.9 Data for Exercise 1 


i Xj Yi i X; Yi 

1 0.3 0.4 6 1.0 0.8 
2; 1.4 0.9 7 2.0 0.7 
3 1.0 0.4 8 —1.0 —0.4 
4 —0.3 —0.3 9 —0.7 —0.2 
5 —0.2 0.3 10 0.7 0.7 
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4. For the data presented in Table 11.9, test at the level of 
significance 0.05 the hypothesis that the regression line is 
horizontal. 


5. For the data presented in Table 11.9, test the following 
hypotheses at the level of significance 0.10: 

Hy: By =5Bo, 

Ay: By FSBo. 
6. For the data presented in Table 11.9, test the hypothesis 


that when x = 1, the height of the regression line is y = 1 
at the level of significance 0.01. 


7. In a problem of simple linear regression, let D = fy + 
x. Show that the random variables £; and D are un- 
correlated, and explain why £, and D must therefore be 
independent. 


8. Let the random variable D be defined as in Exer- 
cise 7, and let the random variable Q? be defined by 
Eq. (11.3.30). 


a. Show that 
Q? i-BD* | (D- By BFz) 
02 Var(p;) Var(D) 


b. Explain why the random variable Q?/o7 will have 
the x? distribution with two degrees of freedom 
when the hypothesis Ap in (11.3.27) is true. 


9. For the data presented in Table 11.9, test the following 
hypotheses at the level of significance 0.05: 


Ao: Bp = 0 and 6, = 1, 


H,: Atleast one of the values Bp = 0 and 
£, = 1 is incorrect. 


10. For the data presented in Table 11.9, construct a con- 
fidence interval for Bo with confidence coefficient 0.95. 


11. For the data presented in Table 11.9, construct a con- 
fidence interval for 6; with confidence coefficient 0.95. 


12. For the data presented in Table 11.9, construct a confi- 
dence interval for 589 — 6, + 4 with confidence coefficient 
0.90. 


13. For the data presented in Table 11.9, construct a con- 
fidence interval with confidence coefficient 0.99 for the 
height of the regression line at the point x = 1. 


14. For the data presented in Table 11.9, construct a con- 
fidence interval with confidence coefficient 0.99 for the 
height of the regression line at the point x = 0.42. 


15. Suppose that in a problem of simple linear regression, 
a confidence interval with confidence coefficient 1 — ag 
(0 < a < 1) is constructed for the height of the regression 
line at a given value of x. Show that the length of this 
confidence interval is shortest when x = x. 


16. Let the statistic U2 be as defined by Eq. (11.3.32), and 
let y be fixed positive constant. Show that for all observed 
values (x;, y;), fori =1,...,n, the set of points (65, 87) 
such that U? < y is the interior of an ellipse in the Bo Br- 
plane. 


17. For the data presented in Table 11.9, construct a con- 
fidence ellipse for By and 6, with confidence coefficient 
0.95. 


18. 


a. For the data presented in Table 11.9, sketch a con- 
fidence band in the xy-plane for the regression line 
with confidence coefficient 0.95. 


b. On the same graph, sketch the curves which specify 
the limits at each point x of a confidence interval 
with confidence coefficient 0.95 for the value of the 
regression line at the point x. 


19. Determine a value of c such that in a problem of sim- 
ple linear regression, the statistic c }*"_,(Y; — By — Box a 
will be an unbiased estimator of o7. 


20. Suppose that a simple linear regression of miles per 
gallon (Y) on car weight (X) has been performed with n = 
32 observations. Suppose that the least-squares estimates 
are By = 68.17 and f, = —1.112, with o/ = 4.281. Other 
useful statistics are ¥ = 30.91, and )7"_, (x; — ¥)* = 2054.8. 


a. Suppose that we want to predict miles per gallon 
Y for a new observation with weight X = 24. What 
would be our prediction? 


b. For the prediction in part (a), find a 95 percent pre- 
diction interval for the unobserved Y value. 


21. Use the data in Table 11.6 on page 707. You should 
perform the least-squares regression requested in Exer- 
cise 18 in Sec. 11.2 before starting this exercise. 


a. Plot the residuals from the least-squares regression 
against the 1970 price. Do you see a pattern? 


b. Transform both prices to their natural logarithms 
and repeat the least-squares regression. Now plot 
the residuals against logarithm of 1970 price. Does 
this plot look any better than the one in part (a)? 


22. Perform a least-squares regression of the logarithm of 
the 1980 fish price on the logarithm of the 1970 fish price, 
using the raw data in Table 11.6 on page 707. 


a. Test the null hypothesis that the slope A, is less than 
2.0 at level ag = 0.01. 
b. Find a 90 percent confidence interval for the slope A}. 


c. Find a 90 percent prediction interval for the 1980 
price of a species that cost 21.4 in 1970. (Note that 
21.4 is the 1970 price, not the logarithm of the 1970 
price.) 


23. Prove that the first test in Theorem 11.3.4 does indeed 
have level ag. Hint: Use an argument similar to that used 
to prove part (ii) of Theorem 9.5.1. 


24. Find explicit formulas (no sup or inf) for the endpoints 
of the interval in Eq. (11.3.36) for the following special 
cases: 


a. co=landcy=x>0. 
b. co=landc;=x <0. 


Hint: In both cases the endpoints are of the form 
Bo + B1x plus or minus linear functions of x that depend on 
the lengths of the sides of the rectangular joint confidence 
set. 


25. In this problem, we will construct a narrower con- 
fidence band for a regression function using Theorem 
11.3.7. Let Bp and A, be the least-squares estimators, and 
let o’ be the estimator of o used in this section. Let x9 < x1 
be two possible values of the predictor X. 
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a. Find formulas for the simultaneous coefficient 1 — ag 
confidence intervals for By + 6x9 and Bo + 4x4. 


b. For each real number x, find the formula for the 
unique @ such that x =axg + (1 —a@)x;. Call that 
value a(x). 


c. Call the intervals found in part (a) (Ag, Bo) and 
(Ay, By), respectively. Define the event 


C = {Ao < Bo + BixX0 < Bo and Ay < fo + Byxy < By}. 


For each real x, define L(x) and U(x) to be, respec- 
tively, the smallest and largest of the following four 
numbers: 


a(x) Ag + [1 — a(x) Aq, e(x) Bg + [1 — w(x) JA, 
a(x)Ag + [1 — a(x)]By, a(x) By + [1 — a(x) JB}. 


If the event C occurs, prove that, for every real x, 
L(x) < Bo + Byx < U(x). 


* 11.4 Bayesian Inference in Simple Linear Regression 


In Sec. 8.6, we introduced an improper prior distribution for the mean jw and 
precision t of a normal distribution. This prior simplified several calculations 
associated with the posterior distribution of the parameters. The prior also made 
some of the resulting inferences bear striking resemblance to inferences based on 
the sampling distributions of statistics. Something very similar occurs in the simple 


linear regression setting. 


Improper Priors for Regression Parameters 


Gasoline Mileage. Once again, consider Example 11.3.2 on page 714. Suppose that 
we are interested in saying something about how far we think A, is from 0 and how 
strongly we believe that. For example, suppose that we would like to be able to say 
how likely it is that || is at most c for arbitrary values of c. To do this requires us 
to compute a distribution for 6,. The posterior distribution of 6, given the observed 
data would serve this purpose. < 


We shall continue to assume that we will observe pairs of variables (X;, Y;) fori = 
1,..., 7. We shall also assume that the conditional distribution of Y;,..., Y,,, given 
.., X, =x, and parameters fp, 6, and o?, is that the Y; are independent 
with Y; having the normal distribution with mean fp + 6x; and variance o7. Let 
t = 1/07 be the precision, as we did in Sec. 8.6. If we let the parameters have an 
improper prior with “p.d.-f.” €(Bo, 61, 7) =1/t, then it is not difficult to find the 


., Y, are independent given x;,...,x, and fp, 6), and t, with 


Example 
11.4.1 
X41 =%X1,- 
posterior distribution of the parameters. 
Theorem Suppose that Yj, .. 
11.4.1 


Y; having the normal distribution with mean fp + 6,x; and precision t. Let the 
prior distribution be improper with “p.d-f.” €(6o, 61, T) =1/t. Then the posterior 
distribution of £p, 6,, and t is as follows. Conditional on rt, the joint distribution of 


Bo and f, is the bivariate normal distribution with correlation —nx/(n ae aay 2 
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Table 11.10 Posterior means and variances for simple 
linear regression with improper prior 


Parameter Mean Variance 
Bo Bo (F +87 /s2)/t 
By By (s2r)1 


Table 11.11 Relation between Eq. (5.10.2) and 
Theorem 11.4.1 

(5.10.2) Theorem 11.4.1 
p =nx/(n Dat) 
ae (1 4x/s?)/t 
a5 a ti 
xy Bo 
My Bo 
x2 By 
Hy Bi 


and means and variances as given in Table 11.10. The posterior distribution of t is 
the gamma distribution with parameters (n — 2)/2 and S*/2, where S? is defined in 
Eq. (11.3.9). The marginal posterior distribution of 


2 / 
x 


Ss oO 


CO (enx ay [cob 3] 
0, (co¥- en) cobo + ¢1B1 — [eoBy + 164 (11.4.1) 


is the ¢ distribution with n — 2 degrees of freedom if cp and c, are not both 0. 


Proof The posterior p.d.f. is proportional to the product of the prior p.d.f. and 
the likelihood function. The likelihood is the conditional p.d.f. of the data Y = 
(Y;,..., Y,) given the parameters (and x = (x1, ..., x,)), namely, 


fla B28) = IR"? exo 5 0% Bo Ars?) (11.4.2) 
i=] 


To show that the posterior distribution is as stated in the theorem, it suffices to prove 
that 1/t times (11.4.2) is proportional (as a function of Bp, B,, and t) to the proposed 
posterior p.d.f. 

The proposed posterior p.d.f. of t is proportional (as a function of t) to 


q0-2)/2-1,-S?4/2_ (11.4.3) 


The proposed conditional posterior p.d.f. of (69, 6,) given t is the bivariate normal 
p.d.f. in Eq. (5.10.2) on page 338 with the substitutions in Table 11.11. 
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The key to simplifying the substitutions in Eq. (5.10.2) is to note that 


2 ¥7 2 — 2 

1 2 Sy Dr i=1*i d Pp TE naS, 
P= 5? aD = ee 
pe nse 0102 at 


The substitutions in Table 11.11 show that the proposed conditional posterior for 
(Bo, 61) given t is proportional to 


T exo(-5 nt — Bo)” + 2nX(Bo — Bo)(Bi — By) + (>: “) (61 — Ar) ; 


i=l 
(11.4.4) 


The product of (11.4.3) and (11.4.4) is the proposed joint posterior p.d.f., and it is 
proportional to 


22-1 exp (-5 [s + n(Bo — Bo)? + 2nx(Bo — Bo) (Bi — BD 
(11.4.5) 


n 
+ (© x?) (6, — Av')) 
i= 
We shall now show that 1/t times the right side of Eq. (11.4.2) is proportional 
to (11.4.5). The summation in the exponent of Eq. (11.4.2) is exactly the same as the 
summation in Eq. (11.3.29) if we remove the asterisks from (11.3.29). In Sec. 11.3, 
we rewrote (11.3.29) as 


S? + n(By — Bo)” + (> “) (B; — By)” + 2nx(Bo — Bo)(Bi — Bi), (11.4.6) 
i=1 


where the asterisks have been removed from (11.4.6). Notice that (11.4.6) is the same 
as the factor in the exponent of (11.4.5) that is multiplied by — 1 /2. Also, notice that 
1/t times the factor multiplying the exponential in (11.4.2) equals r”/?—!. It follows 
that 1/r times (11.4.2) is proportional to (11.4.5). 

Finally, we prove that the random variable in (11.4.1) has the r distribution 
with n — 2 degrees of freedom. Since (fp, 6,) has a bivariate normal distribution 
conditional on t, it follows that cof) + c,6,; has a normal distribution conditional 
on t. Its mean is cy + cA}. Its variance (given rT) is obtained from Eq. (5.10.9) and 
Table 11.10 (after some tedious algebra) as v/t where 


Define the random variable 


1/2 . . 
Z= (£) (coBo + €1B1 — [coBo + ¢1B1). 


and notice that Z has the standard normal distribution given t and hence is indepen- 
dent of rt. The distribution of W = S?z is the gamma distribution with parameters 
(n — 2)/2 and 1/2, which is also the x? distribution with n — 2 degrees of freedom. It 
follows from the definition of the t distribution that Z/(W/[n — 2])””7 has the t dis- 
tribution with n — 2 degrees of freedom. Since o” = 8? /(n — 2), it is straightforward 
to verify that Z/(W/[n — 2])!”? is the same as the random variable in (11.4.1). rT] 
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Example 
11.4.2 


Pressure and the Boiling Point of Water. At the end of Example 11.3.6, we estimated the 
coefficients of the regression of log-pressure on the boiling point using only 16 of the 
17 observations in Forbes’ original data. We obtained By = —0.9518 and B, = 0.0205 
with o’ = 2.616 x 1073. With one observation removed, we have n = 16, ¥ = 202.85, 
and sf = 527.9. We can now apply Theorem 11.4.1 to make an inference based on the 
posterior distributions of the parameters. For example, suppose that we are interested 
in an interval estimate of 6;. Letting cg = 0 and c, = 1 in (11.4.1), we find that the 
posterior distribution of 


** (B; — By) = 449.2(B, — 0.0205) (11.4.7) 
(ox 


is the ¢ distribution with 14 degrees of freedom. If we want our interval to contain 
a portion of the posterior distribution with probability 1— ag, then we can note 
that the posterior probability is 1 — ag that |449.2(6,; — 0.0205)| < rea — 0/2). 
For example, if a = 0.1, then Td — 0.1/2) = 1.761. The interval estimate is then 
0.0205 + 1.761/449.2 = (0.0166, 0.0244). < 


The reader should note that the random variable in Eq. (11.4.7) is the same as U; 
in Eq. (11.3.22) when 6; = 6;. This implies that a coefficient 1 — ap confidence interval 
for 6; will be the same as an interval containing posterior probability 1 — a) when we 
use the improper prior in Theorem 11.4.1. Indeed, the random variable in (11.4.1) 
is the same as Up, in Eq. (11.3.14) for all cy and c; so long as cpg + c,h, = c,.. This 
implies that coefficient 1 — ap confidence intervals for all linear combinations of the 
regression parameters will also contain probability 1 — ag of the posterior distribution 
when the improper prior in Theorem 11.4.1 is used. The reader can prove these claims 
in Exercises 1 and 2 in this section. 


Note: There is a Conjugate Family of Proper Prior Distributions. The posterior 
distribution of the parameters given in Theorem 11.4.1 has the following form: t 
has a gamma distribution, and, conditional on t, (fo, 6,) has a bivariate normal 
distribution with variances and covariances that are multiples of 1/t. The collection 
of distributions of the form just described is a conjugate family of prior distributions 
for the parameters of simple linear regression. Readers interested in the details of 
using such priors can consult a text like Broemeling (1985). 


Prediction Intervals 


On page 716, we showed how to form intervals for predicting future observations. 
In the Bayesian framework, we can also form intervals for predicting future ob- 
servations. Let Y be a future observation with corresponding predictor x. Then 
Z, =1'/?(Y — Bo — 61x) has the standard normal distribution conditional on the pa- 
rameters and the data; hence, it is independent of the parameters and the data. Let 
Y = By + yx as we did on page 716. It can be shown that the conditional distribution 
of Z) = 1 /?(By + Bix — Y) given tT, and the data is the normal distribution with mean 
0 and variance 
1. @=x) 
++ 5 
x 


n S 


and hence it is independent of t and the data. (See Exercise 3.) Since Z, is inde- 
pendent of all of the parameters, it is independent of Z>, also. It follows that the 
conditional distribution of Z, + Z, = 1t!/?(Y — Y ), given t and the data, is the normal 


Example 
11.4.3 
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distribution with mean 0 and variance 


2 
(po! = 
n 


Sy 


As in the proof of Theorem 11.4.1, S*z has the x? distribution with n — 2 degrees 
of freedom and is independent of Z, + Z>. It follows from the definition of the t¢ 
distribution that the random variable 


y= 
2 
o’ red + ee 


has the ¢ distribution with n — 2 degrees of freedom given the data. Hence, the 
conditional probability, given the data, is 1 — ag that Y is in the interval with endpoints 


i <2 
Ya rl - <0) a +? oo] (11.4.8) 


x 


U, = 


Notice that the U, defined above is identical to the U, defined in Eq. (11.3.26). Also, 
the interval (11.4.8) is the same as the one given in Eq. (11.3.25). The interpretation 
of the prediction interval based on the posterior distribution is somewhat simpler 
than the interpretation given after (11.3.25) because the probability is conditional 
on all of the known quantities (that is, the data). The probability only concerns the 
distribution of the unknown quantity Y conditional on the data. 


Pressure and the Boiling Point of Water. Suppose that we are interested in predicting 
pressure when the boiling point of water is 208 degrees. We shall find an interval 
such that the posterior probability is 0.9 that the pressure will be in the interval. That 
is, we shall use Eq. (11.4.8) with ag = 0.1 and x = 208. We can find T,4(0.95) = 1.761 
from the table of the rt distribution in this book. The rest of the necessary values 
are given in Example 11.4.2. In particular, with Y standing for log-pressure, Y = 
—0.9518 + 0.0205 x 208 = 3.3122, and 


1. («-x) sie 1 . (208 — 202.85)" ue 
, x =X 23 = . 

{4o4 ee = 2.616 x 1077] 1 

fied es * pate 527.9 


x 


= 2.759 x 1073. 


So our interval for log-pressure has endpoints 3.3122 + 1.761 x 2.759 x 1077, which 
are 3.307 and 3.317. The interval for pressure itself is then 


(3307, 63317) — (27.31, 27.58). 


The reason that we can convert the interval for log-pressure into the interval for 
pressure so simply is that 3.307 < Y < 3.317 if and only if 27.31 < eY < 27.58. So, 
the posterior probability of the first set of inequalities is the same as the posterior 
probability of the second set of inequalities. < 


Tests of Hypotheses 


On page 607, we began a discussion of tests based on the posterior distribution. If 
the cost of type I error is wo and the cost of type II error is w;, we found that the 
Bayes test was to reject the null hypothesis if the posterior probability of the null 
hypothesis is less than w,/(wp + w,). Suppose that we use the improper prior and 
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Example 
11.4.4 


that the null hypothesis is Ho : coBp + ch, = c,.. Since the posterior distribution of 
coo + c1f1 is a continuous distribution, it is clear that the posterior probability of 
the null hypothesis is 0. For this reason, we shall begin by considering Bayes tests 
only for one-sided hypotheses. Suppose that the hypotheses of interest are 


Hy: coBy + €1B1 S Cx: 

Ay : Cobo + c1 By > Cy. 
The other direction can be handled in a similar fashion. Let ag = w,/(wo + w1). The 
posterior probability that the null hypothesis is true is the posterior probability that 


coBo + c1B1 < c,. We have already derived the posterior distribution of co8) + c,h; in 
Theorem 11.4.1. So, we can compute 


Pr(coBo + cy By < Cx) 


2 = 9-2 7 ; 
=Pr E wr | Cobo + ¢1B1 — [eoBo + €1B1] 


(11.4.9) 


+ 
n 2 ; 


Sy oO 


js 7 97-12 Ps 
4 E i (cox — 4) c, — [coBo + 161] 
~ln 2 / 


Sx 


oO 


2 if 
x 


P _ 94712 : F 
a ae E as (cox — cy) Cx — [eoBo + ¢1B1] 
=, i= 

n 


AY oO 


= n—2(—Uo)), 


where 7,5 denotes the c.d.f. of the ¢ distribution with n — 2 degrees of freedom 
and Up, is the random variable defined in Eq. (11.3.14). It is simple to see that 
T,,-2(—Up1) < aq if and only if Up; > tT 5 — ap). Hence, the Bayes test of the 
hypotheses (11.4.9) is the same as the level ag test of these same hypotheses that 
was derived after Eq. (11.3.16). Hence, all of the one-sided tests that we learned how 
to perform in Sec. 11.3 are also Bayes tests when the improper prior is used. 

On page 610, we began a discussion of how to deal with two-sided alternatives 
when the posterior distribution of the parameter was continuous. The same approach 
can be used in linear regression problems. We shall illustrate with an example. 


Gasoline Mileage. In Example 11.4.1, we wanted to make use of the posterior distri- 
bution of the slope parameter 8, from Example 11.3.2 in order to be able to say how 
likely we believe it is that £, is close to 0. We can draw a plot of the posterior c.d.f. 
of |B;| by making use of Theorem 11.4.1. The posterior distribution of s,(8, — 1)/o’ 
is the ¢ distribution with n — 2 degrees of freedom. In Example 11.3.2, we computed 
s, = 1036.9, o’ =7.181 x 10-3, 8, = 1.396 x 10-4, and n = 173. It follows that, for all 
positive c, 


1036.9 4 
P <c)=Pr(-—c < 8B, <c) =T 1.396 x 10 
T(|Bi| Sc) = Pr(—e < fi Sc) n(n 2102” x ) 
1036.9 _4 
T; 1.396 x 10 ; 
in aa <= * )) 


where 77; is the c.d-f. of the t distribution with 171 degrees of freedom. Figure 11.14 
contains a plot of the posterior c.d.f. of |6;|. We can see that the probability is 
essentially 1 that |6,| < 1.6 x 10~4, but it is also essentially 1 that |6,| > 1.2 x 1074. 
These numbers may look small. However, remember that 6; must get multiplied by 


Figure 11.14 Plot of pos- 
terior c.d.f. of |6,| in Exam- 
ple 11.4.4. 


Exercises 
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ue 
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= 
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horsepower, which is typically a number in the 50-300 range. So, even if 6, is as small 
as 1.2 x 10-4, the difference between gallons per mile at 100 and 200 horsepower 
will be 0.012, which is a sizeable difference in gallons per mile. We can also translate 
this result into miles per gallon. Suppose that £6, = 1.2 x 1074, and suppose that Bo 
equals its conditional mean given that 6, = 1.2 x 10-4. This conditional mean can be 
computed using the method of Exercise 7, and it equals 0.01897. Then the miles per 
gallon for a 200 horsepower car is 23.27, and the miles per gallon for a 100 horsepower 
car is 32.23. <4 


Summary 


We have used improper prior distributions for the parameters of the simple linear 
regression model, and we have found the posterior distributions of the parameters 
after observing n data points. The posterior distributions of the intercept and slope 
parameters are rf distributions with n — 2 degrees of freedom that have been shifted 
and rescaled. These posterior distributions show striking similarities to the sampling 
distributions of the least-squares estimators. Indeed, posterior probability intervals 
for the parameters are exactly the same as confidence intervals, prediction intervals 
for future observations are the same as those based on the sampling distributions, and 
level wp tests of one-sided null and alternative hypotheses reject the null hypotheses 
when the posterior probability of the null hypothesis is less than ap. The only signifi- 
cant lack of connection between posterior calculations and those based on sampling 
distributions is the testing of hypotheses in which the alternative is two-sided. 


1. Assume the usual conditions for simple linear regres- 
sion. Assume that we use the improper prior discussed in 
this section. Let (a, b) be the observed value of a coef- 
ficient 1 — ag confidence interval for 6; constructed as in 
Sec. 11.3. Prove that the posterior probability is 1 — ap that 
a < py <b. 


2. Assume the usual conditions for simple linear regres- 
sion. Assume that we use the improper prior discussed 
in this section. Let (a, b) be the observed value of a co- 
efficient 1 — ap confidence interval for cof) + c,h, con- 


structed as in Sec. 11.3. Prove that the posterior proba- 
bility is Hl — ag that a< coBo + cB, <b. 


3. Assume a simple linear regression model with the im- 
proper prior. Show that, conditional on rt, the posterior 


distribution of 1/?(By + Bix — Y ) is the normal distribu- 
tion with mean 0 and variance 


1, @=%x)* 


n 2 
x 
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4. We wish to fit a simple linear regression model to the 
data in Table 11.9 on page 727. Use an improper prior 
distribution. 
a. Find the posterior distribution of the parameters. 
b. Find a bounded interval that contains 90 percent of 
the posterior distribution of Ay. 


c. Find the probability that 6p is between 0 and 2. 


5. Use the data in Table 11.9, and suppose that we wish 
to fit a simple linear regression model to the data. Use the 
improper prior. 
a. Find the posterior distribution of the slope parame- 
ter Bj. 
b. Find the posterior distribution of By + 61, the mean 
of a future observation Y corresponding to x = 1. 


c. Draw a graph of the posterior c.d.f. of |6; — 0.7]. 


6. Use the data in Table 11.6 on page 707. Assume that we 
wish to fit a simple linear regression model for predicting 
logarithm of 1980 price from logarithm of 1970 price. 


a. Find the posterior distribution of the slope parame- 
ter Bj. 
b. Find the posterior probability that 6; <2. 


c. Find a 95 percent prediction interval for the 1980 
price of a species that cost 21.4 in 1970. 


7. In a simple linear regression problem with the usual 
improper prior, prove that the conditional mean of fp 
given f; is by — X(B1 — fy). Hint: Use the fact that (Bo, 6) 
has a bivariate normal distribution as described in Theo- 
rem 11.4.1, and then use Eq. (5.10.6) to find the condi- 
tional mean. 


11.5 The General Linear Model and 
Multiple Regression 


The simple linear regression model can be extended to allow the mean of Y to bea 
function of several predictor variables. Much of the resulting distribution theory, 
is very similar to the simple regression case. 


The General Linear Model 


Unemployment in the 1950s. The data in Table 11.12 provide the unemployment rates 
during the 10 years from 1950 to 1959 together with an index of industrial production 


from the Federal Reserve Board. It might make sense to think that unemployment 
is related to industrial production. Other factors also play a role, and those other 
factors most likely changed over the course of the decade. As a surrogate for these 
other factors, some function of the year could be included as a predictor. Figure 11.15 
shows plots of unemployment against each of the two predictor variables. It is not 
clear from the plots precisely how unemployment varies with the two predictors, but 
there appear to be some relationships. In this section, we shall show how to fit a 
regression model with more than one predictor to these and other data. 4 


In this section, we shall study regression problems in which the observations 
Y,,..-.,Y, satisfy assumptions like Assumptions 11.2.1-11.2.5 that were made in 
Sections 11.2 and 11.3. In particular, we shall again assume that each observation 


., Y, are independent, and 


., Y, have the same variance o”. Instead of a single 


predictor being associated with each Y;, we assume that a p-dimensional vector 


, Zip—1) iS associated with each Y;. The assumptions that we make can 


Example 
11.5.1 
Y; has a normal distribution, that the observations Yj, .. 
that the observations Yj, .. 
4 = (Zio, achive 
now be restated in this framework. 
Assumption Predictor is known. Either the vectors zy, .. 


11.5.1 


are the observed values of random vectors Z, .. 
before computing the joint distribution of (7, . . 


., Z, are known ahead of time, or they 
., Z, on whose values we condition 
i eV a) 


Figure 11.15 Plots of 
unemployment against the 
two predictor variables for 
Example 11.5.1. 


Assumption 
11.5.2 


Assumption 
11.5.3 


Assumption 
11.5.4 


Assumption 
11.5.5 
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Table 11.12 Unemployment data for Example 11.5.1 
Unemployment Index of production Year 
call 113 1950 
1.9 123 1951 
1.7 127 1952 
1.6 138 1953 
3.2 130 1954 
2.7 146 1955 
2.6 151 1956 
2.9 152 1957 
4.7 141 1958 
3.8 159 1959 
A A 
e e 
45> 454 
40+ 4.0 
e e 
oes 3) 
z 3.0 4° ° . 2 30+ ° 7 ; 
Fost °e Fost °e 
= 5 
20+ ¢ 2.0 - 
ist ° e 1s+ °e 
t+ +> 1 — ib he 
0 120 130 140 150 160 0 1950 1952 1954 1956 1958 
Index of production Year 
Normality. For i=1,...,n, the conditional distribution of Y; given the vectors 
Z1,-...,Z, 1S a normal distribution. 
Linear Mean. There is a vector of parameters 6B = (fo, ..., 6p—1) such that the con- 
ditional mean of Y; given the values z;,..., Z, has the form 
Z,0Po + 2121 + °° + 2p 1B pt (113.1) 
fori=l,...,n. 


Common Variance. There is a parameter o7 such that the conditional variance of Y; 
given the values z;,..., Z,, is o* fori=1,...,n. 


Independence. The random variables Y;, ..., Y,, are independent given the observed 
Zoe Spe 
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The generalization that we introduce here is that the mean of each observation Y; 
is a linear combination of p unknown parameters fp, ..., 8p—; as in (11.5.1). Each 
value z;; either may be fixed by the experimenter before the experiment is started 
or may be observed in the experiment along with the value of Y,. In the latter case, 
Eq. (11.5.1) gives the conditional mean of Y; given the observed z;; values. 


General Linear Model. The statistical model in which the observations Yj,..., Y,, 
satisfy Assumptions 11.5.1-11.5.5 is called the general linear model. 


In Definition 11.5.1, the term /inear refers to the fact that the expectation of each 
observation Y; is a linear function of the unknown parameters Bp, ..., Bp—1- 

Many different types of regression problems are examples of general linear 
models. For example, in a problem of simple linear regression, E(Y;) = Bo + 61x; for 


i=1,...,n. This expectation can be represented in the form given in Eq. (11.5.1), 
with p = 2, by letting z;) = 1 and z;; =x; fori =1,..., 7. Similarly, if the regression 
of Y on X is a polynomial of degree k, then, fori =1,...,n, 

E(Y;) = Bo + Bix; +++ + + Bexi. (11.5.2) 


In this case, p =k +1 and E(Y;) can be represented in the form given in Eq. (11.5.1) 
by letting z;; =x} for j =0,...,k. 

As a final example, consider a problem in which the regression of Y onk variables 
X1,..., X;is a function like that given in Eq. (11.2.1). A problem of this type is called 
a problem of multiple linear regression because we are considering the regression of Y 
onk variables X;,..., X;,, rather than onjust a single variable X, and we are assuming 
also that this regression is a linear function of the parameters Bo, ..., 6,. Inaproblem 
of multiple linear regression, we obtain n vectors of observations (x1, ..., X;x, Y;), for 
i=1,...,n.Herex,, is the observed value of the variable X . for the ith observation. 


ij 
Then £(Y;) is given by the relation 


E(Y;) = Bo + Byxit +++ + + ByXix- (10-53) 


This expectation can also be represented in the form given in Eq. (11.5.1), with 
p=k +1, by letting zj9 = 1 and z;; = x;; for j =1,...,k. 


Unemployment in the 1950s. In Example 11.5.1, we can let Y stand for the unemploy- 
ment rate, while X, stands for the index of production and X> stands for the year. 
< 


Our discussion has indicated that the general linear model is general enough to 
include problems of simple and multiple linear regression, problems in which the 
regression function is a polynomial, problems in which the regression function has 
the form given in Eq. (11.1.16), and many other problems. 

Some books devoted to regression and other linear models are Cook and Weis- 
berg (1999), Draper and Smith (1998), Graybill and Iyer (1994), and Weisberg (1985). 


Maximum Likelihood Estimators 


We shall now describe a procedure for determining the M.L.E.’s of fo, ..., Bp; in 
the general linear model. Since E(Y;) is given by Eq. (11.5.1) fori =1,..., 7, the 
likelihood function after observing values y,,..., y, will have the following form: 


1 1 n 
(Qn02)"/2 e<| 752 XxGr Z;0By — + * sty] (11.5.4) 
i=l 
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Since the M.L.E.’s are the values that maximize the likelihood function (11.5.4), it 
can be seen that the estimates Bp, ..., BoA will be the values of Bo, ..., 8,1 for 
which the following sum of squares Q is minimized: 


n 
Q= be =20 p92 = Zin iPg ay (11.5.5) 
i=1 
Since Q is the sum of the squares of the deviations of the observed values from the 
linear function given in Eq. (11.5.1), it follows that the M.L.E.’s Bp, ..., By-1 will be 
the same as the least-squares estimates. 

To determine the values of By,..., B p—1, we can calculate the p partial deriva- 
tives 0Q/0B; for j =0,..., p — 1 and can set each of these derivatives equal to 0. 
The resulting p equations, which are called the normal equations, will form a set of p 
linear equations in Bo, ..., 8,1. We shall assume that the p x p matrix formed by 
the coefficients of By, ..., B,—1 in the normal equations is nonsingular. Then these 
equations will have a unique solution Ap, ..., eee and Bo, .... pea will be both 
the M.L.E.’s and the least-squares estimates of Bo, ..., Bp—1- 

For a problem of polynomial regression in which E(Y;) is given by Eq. (11.5.2), 
the normal equations were presented as the relations (11.1.8). For a problem of mul- 
tiple linear regression in which E(Y;) is given by Eq. (11.5.3), the normal equations 
were presented as the relations (11.1.13). 

If we substitute 8; for B; fori =0,..., p — 1in the formula for Q in Eq. (11.5.5), 
we obtain 


s° = a, ~ 2:0Bo i Zip—1B p—1)-- (11.5.6) 
i=l 


Eq. (11.5.6) is the natural generalization of Eq. (11.3.9) to the multiple regression 
case. It can be shown using the same method outlined in the proof of Theorem 11.2.1 
that the M.L.E. of o? in the general linear model is 


a 
Gat 


n 


(11.5.7) 


The details are left to Exercise 1 at the end of this section. In analogy to Eq. (11.3.12), 


we define the useful quantity 
1/2 
2 
Peet a . (11.5.8) 
=p 


This makes o” an unbiased estimator of o?. (See Exercise 2.) 


Explicit Form of the Estimators 


In order to derive the explicit form and the properties of the estimators fp, ..., B pais 
it is convenient to use the notation and techniques of vectors and matrices. We shall 
let the n x p matrix Z be defined as follows: 


710 *"° 21p-1 


eee (11.5.9) 


Zn0 ***  2np-1 
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This matrix Z distinguishes one regression problem from another, because the en- 
tries in Z determine the particular linear combinations of the unknown parameters 
Bo, ---» Bp—1 that are relevant in a given problem. 


Design Matrix. The matrix Z in Eq. (11.5.9) for a general linear model is called the 
design matrix of the model. 


The name “design matrix” comes from the case in which the z;; are chosen by the 
experimenter to achieve a well-designed experiment. It should be kept in mind, 
however, that some or all of the entries in Z may be simply the observed values 
of certain variables, and may not actually be controlled by the Saigon 

We shall also let y be the n x 1 vector of observed values of Y;,..., Y,,, B be the 
p x 1 vector of parameters, and B be the p x 1 vector of estimates. These vectors may 
be represented as follows: 


y1 Bo Bo 
y=| i]. B=! : |. and p=! :; 
Yn By-1 Ped 


The transpose of a vector or matrix v will be denoted by v’. 


General Linear Model Estimators. The least squares estimator (and M.L.E.) of B is 


p=(Z'Z) '7Z'Y. (11.5.10) 


Proof The sum of squares Q given in Eq. (11.5.5) can be written in the following 
form: 


OQ =(y— ZB)'(y — ZB). 


Since Q is a quadratic function of the coordinates of 8, it is straightforward to take 
the partial derivatives of Q with respect to these coordinates and set them equal to 
0. For example, the partial derivative with respect to Bo is 


p-l n 


. pay: Zi0Yi 2D) Y > 2402i;- (11.5.11) 
0 j=0 i=l 
Each of the other partial derivatives re an equation similar to (11.5.11). Set 


the right-hand sides of each of these p equations to 0, and arranged them into the 
following matrix equation: 


Z'ZB=Z'y. (11.5.12) 


Because it is assumed that the p x p matrix Z’Z is nonsingular, the vector of esti- 
mates B will be the unique solution of Eq. (11.5.12). In order for Z'Z to be nonsin- 
gular, the number of observations n must be at least p, and there must be at least 
p linearly independent rows in the matrix Z. When this assumption is satisfied, it 
follows from Eq. (11.5.12) that 6 = (Z'Z)~!Z’y. Thus, if we replace the vector y of 
observed values by the vector Y of random variables, the form for the vector of 
estimators B will be (11.5.10). r 


Virtually every statistical computer package will calculate least-squares estimates for 
a multiple linear regression. Even some handheld calculators will perform multiple 
linear regression. The matrix (Z’Z)~! is useful for more than just computing Bi in 
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Eq. (11.5.10), as we shall see later in this section. Not every piece of regression 
software makes it easy to access this matrix. 

It follows from Eq. (11.5.10) that each of the estimators Bp, ..., Bei will be 
a linear combination of the coordinates Y;,..., Y, of the vector Y. Since each of 
these coordinates has a normal distribution and they are independent, it follows that 
each estimator B; will also have a normal distribution. Indeed, the entire vector B 
has a joint normal distribution (called a multivariate normal distribution), which is 
a generalization of the bivariate normal distribution to more than two coordinates. 
We shall not discuss the multivariate normal distribution in detail in this text, but 
we shall merely point out one feature that it has in common with the bivariate 
normal distribution: If a vector B has a multivariate normal distribution, then every 
linear combination of the coordinates of B has a normal distribution. Indeed, every 
collection of linear combinations of the coordinates of B has a multivariate normal 
distribution. 


Unemployment in the 1950s. The matrix Z in Example 11.5.1 has three columns. The 
first column is the number 1 ten times. The second column is the second column of 
Table 11.12. In order to avoid some numerical problems, we shall let the third column 
of Z be the third column of Table 11.12 minus 1949. The vector y is the first column 
of Table 11.12. We can then compute the matrix (Z 'Z)—| and the vector Z’ y: 


38.35 —0.3323 1.383 28.2 
(Z'Z)-1= | -0.3323 2.915x 10-3 —0.01272 Z'y=| 3931 
1.383 —0.01272 0.06762 172.3 
We can then use Eq. (11.5.10) to compute 
13.45 
B=| —0.1033 
0.6594 
We shall examine the residuals later in this section. < 


Mean Vector and Covariance Matrix 


We shall now derive the means, variances, and covariances of fy, ..., 8 p—1- Suppose 
that Y is an n-dimensional random vector with coordinates Y;,..., Y,,. Thus, 
Y 
Y=] : |. (11.5.13) 
Y, 


The expectation E(Y ) of this random vector is defined to be the n-dimensional vector 
whose coordinates are the expectations of the individual coordinates of Y. Hence, 


E(%) 
E(Y)= 
E(Y,) 


Mean Vector/Covariance Matrix. If Y isarandom vector, then the vector E(Y ) is called 
the mean vector of Y. The covariance matrix of Y is defined to be the n x n matrix 
such that, fori =1,...,nandj =1,...,n, the elementin the ithrow and jthcolumn 
is Cov(Y;, ¥;). We shall let Cov(Y ) denote this covariance matrix. 
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For example, if Cov(Y;, Y;) = 0;; for alli and j, then 


Oy. Ot Oy 
Cov(Y) = : 
Ont °° Onn 
For i=1,...,n, Var(Y;) = Cov(Y;, Y;) =0;;. Therefore, the n diagonal ele- 
ments of the matrix Cov(Y ) are the variances of Y;,..., Y,. Furthermore, since 


Cov(¥;, Y;) = Cov(Y;, ¥;), then o;; =0;;. Therefore, the matrix Cov(Y ) must be 
symmetric. 

The mean vector and the covariance matrix of the random vector Y in the general 
linear model can easily be determined. It follows from Eq. (11.5.1) that 


E(Y)=ZB. (11.5.14) 

Also, the coordinates Y,,..., ¥,, of Y are independent, and the variance of each of 
these coordinates is o”. Therefore, 

Cov(Y) =07, (11.5.15) 


where J is the n x n identity matrix. 
The following result helps us find the mean vector and covariance matrix of p. 


Suppose that Y is an n-dimensional random vector as specified by Eq. (11.5.13), 
for which the mean vector E(Y ) and the covariance matrix Cov(Y ) exist. Suppose 
also that A is a p x n matrix whose elements are constants, and that W is a p- 
dimensional random vector defined by the relation W = AY. Then E(W) = AE(Y) 
and Cov(W ) =A Cov(Y )A’. 


Proof Let the elements of matrix A be denoted as follows: 
a1 77" an 
A= : 
Gp-11 *** Gp—-in 
Then the ith coordinate of the vector E(W ) is 
n n 


E(W)=E| >) ajj¥)] =>. aj EY). (11.5.16) 
j=l j=l 


It can be seen that the final summation in Eq. (11.5.16) is the ith coordinate of the 
vector AE(Y ). Hence, E(W) =AE(Y). 

Next, fori =0,..., p—1land j =0,..., p — 1, the element in the ith row and 
jth column of the p x p matrix Cov(W ) is 


n n 
Cov(W;, W;) = Cov (: ite Ye on) 
s=l1 


r=1 


Therefore, by Exercise 8 of Sec. 4.6, 


Cov(W;, W;)=)° Y° aj,aj, Cov(Y,, ¥,). (11.5.17) 
r=1 s=1 
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Using the formula for matrix multiplication, one finds that the right side of 
Eq. (11.5.17) is the element in the ith row and jth column of the p x p ma- 
trix A Cov(Y )A’. Hence, Cov(W ) = A Cov(Y )A’. | 


The means, the variances, and the covariances of the estimators bo: eet B p-1can 
be obtained by applying Theorem 11.5.2. 


In the general linear model, E(B) = B, and Cov(f) =0°(Z'Z)"1. 


Proof Eq. (11.5.10) says that B can be represented in the form B =AY, where 
A=(Z'Z)“'Z'. Therefore, it follows from Theorem 11.5.2 and Eq. (11.5.14) that 


E(B) =(Z'Z) 'Z'E(Y ) =(Z'Z)"'Z'ZB =B. 
Also, it follows from Theorem 11.5.2 and Eq. (11.5.15) that 


Cov(B) = (Z'Z)7!Z' Cov(¥ )Z(Z'Z)7! 
=(Z'Z)'Z' (0° DZ(Z'Z) | 
=o7(Z'Z)"!. ia 


Thus, E(6/) = Bj for 7 =0,..., p—1,andfor j=0,..., p— 1, Var(B ) equals o 
times the jth diagonal entry of the matrix (Z 'Z)—'. Also, fori # j, Covi f;, Bj) will be 
equal to o? times the entry in the ith row and jth column of the matrix (Z’Z)~!. 


Dishwasher Shipments. The United States Department of Commerce collects data on 
factory shipments of durable goods as well as other economic indicators. Table 11.13 
contains the numbers of factory shipments of dishwashers (in thousands) and private 
residential investment in billions of 1972 dollars for the years 1960 through 1985. 
Figure 11.16 shows plots of dishwasher shipments against year and private residential 
investment. Let Y stand for dishwasher shipments. We could fit a model in which the 
mean of Y is given by Eq. (11.5.3) with k = 2. The matrix Z would have three columns 
and 26 rows. The first column would be all the number 1. The second column would 
have time, expressed as the year minus 1960 for numerical stability. The third column 
would have private residential investment. We can then compute 


1.152 0.01279 —0.02660 
(Z'Z) b= 0.01279 0.001136 —0.0005636 
—0.02660  —0.0005636 0.0007026 


The correlation between f, and A, can be computed as 


Cov(,, B>) —0.000563602 
me I =. = 06309. 
(Var(B,) Var(Bs))'/2 (0.00113602 x 0.000702602)1/2 


Notice that the correlation does not depend on the unknown value of o?, but only on 
the design matrix. Also notice that the correlation is negative and sizeable. If one of 
the coefficients is overestimated, the other one will tend to be underestimated. < 
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Figure 11.16 Plots of dish- 
washer shipments against 
year (left) and private resi- 
dential investment (right). 


Table 11.13 Dishwasher shipments and residential investment from 


1960-1985 


Dishwasher shipments 


Private residential investment 


Year 


Year (thousands) (billions of 1972 dollars) 
1960 555 34.2 
1961 620 34.3 
1962 720 37.7 
1963 880 42.5 
1964 1050 43.1 
1965 1290 42.7 
1966 1528 38.2 
1967 1586 37.1 
1968 1960 43.1 
1969 2118 43.6 
1970 2116 41.0 
1971 2477 53.7 
1972 3199 63.8 
1973 3702 62.3 
1974 3320 48.2 
1975 2702 42.2 
1976 3140 51.2 
1977 3356 60.7 
1978 3558 62.4 
1979 3488 59.1 
1980 2738 47.1 
1981 2484 44.7 
1982 2170 37.8 
1983 3092 52.7 
1984 3491 60.3 
1985 3536 61.4 
A 
e e 
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The Joint Distribution of the Estimators 


Let the random variable S” be defined as in Eq. (11.5.6). The sum of squares S* can 
also be represented in the following form: 


S* —(¥ — Zp) (¥ — Zp). (11.5.18) 


The method in the proof of Theorem 11.3.2 can be extended by making use of 
methods that are beyond the scope of this book in order to prove the following two 
facts. First, S*/o* has the x? distribution with n — p degrees of freedom. Second, S$? 
and the random vector f are independent. 

From Eq. (11.5.7), we see that 6? = S?/n. Hence, the random variable né?/o? 
has the x? distribution with n — p degrees of freedom, and the estimators 6? and B 
are independent. 

The following result summarizes what we have proven and stated without proof 
concerning the joint distribution of B and 6?. 


Let the entries in the symmetric p x p matrix (Z’Z)~! be denoted as follows: 


Soo tt Sop-4 
(Z'Zy*= _ 3 (11.5.19) 
Sp-10 ***  Sp—1p-1 
For j =0,..., p — 1, the estimator ; has the normal distribution with mean f; and 


variance ear, Furthermore, fori 4 j, we have Cov( B:, B = Si ae. Also, the entire 


vector B has a multivariate normal distribution. Finally, 6 is independent of B and 
n6é~/o* has the x? distribution with n — p degrees of freedom.  ] 


Note that is also independent of o” from Eq. (11.5.8). 


Testing Hypotheses 


Suppose that it is desired to test the hypothesis that one of the regression coefficients 
6B; has a particular value B;. In other words, suppose that the following hypotheses 
are to be tested: ) 


Hy: Bj =B%, 


‘ (11.5.20) 
Ay : Bj # B; : 
Since Var(B pas jos it follows that when A) is true, the following random variable 
W; will have the standard normal distribution: 


_ bj - 8) 
iO ay2 
iyo 
Furthermore, since the random variable S*/o* has the x? distribution with n — p 


degrees of freedom, and since S$? and B ; are independent, it follows that when Hp is 
true, the following random variable U; will have the r distribution with n — p degrees 
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of freedom: 


7 W; (B; — BY) 


U: = = < 
J 1 2 1/2 (E4y70" 
n— p o2 


The level ag test of the hypotheses (11.5.20) specifies that the null hypothesis Ho 
should be rejected if |U;| = Tl — /2), where Tey is the quantile function of 
the r distribution with n — p degrees of freedom. Furthermore, if u is the value of U; 


observed in a given problem, the corresponding p-value is 


(1.5.21) 


Pr(U; > |ul) + Pr; < —lu)). (11.5.22) 


Tests for one-sided hypotheses can be derived in a similar fashion. 


Dishwasher Shipments. In Example 11.5.4, the least-squares estimates for the model 
are By = —1314, A, = 66.91, and 8, = 58.86. The observed value of o’ is 352.9. Now 
suppose that we are interested in testing the hypotheses 

Ho 5 By = 0, 

Ai: 6, #0, 
where £, is the coefficient of time in the multiple linear regression model. Using the 
matrix (Z’Z)~! found in Example 11.5.4, we can calculate 
_ 66.91 — 0 
~ (0.001136)!/2 x 352.9 
The degrees of freedom are 26 — 3 = 23, and 5.625 is larger than every quantile listed 


in the table of the ¢ distribution in this book. Using a computer program, we find that 
the p-value is about 1 x 10~>. < 


= 5.625. 


1 


Unemployment in the 1950s. In Example 11.5.3, we regressed unemployment on a 
Federal Reserve Board index of production and time. The least-squares estimates 
are fy = 13.45, 6; = —0.1033, and f, = 0.6594. The observed value of o’ is 0.4011. 
Now suppose that we wish to test the hypotheses 

Ho: Bo < 0.4, 

A, : Bo > 0.4. 
To test these hypotheses, we reject Hy if U> is too large. We calculate U, using the 
matrix (Z'Z)~! computed in Example 11.5.3: 
_ 0.6594 — 0.4 
~ (0.06762)!/2 x 0.4011 
The degrees of freedom are 10 — 3 =7, and 2.487 falls between the 0.975 and 0.99 


quantiles of the tf distribution with seven degrees of freedom. The p-value is actually 
0.0209, so we would reject Ho at every level apy > 0.0209. S| 


= 2.487. 


2 


Problems of testing hypotheses that specify the values of two coefficients 6; and 
8; are discussed in Exercises 17 to 21 at the end of this section. Problems of testing 
hypotheses about linear combinations of fo, . .. , 8,1 are the subject of Exercise 26. 
Some computer programs make it easy to test hypotheses about individual 6;’s. 
Indeed, most software automatically supplies the value of the test statistic U; for 
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testing the following hypotheses for each j (j =0,..., k): 
Hy: 6; =0, 
= ue (1.5.23) 
A, 4 B; # 0. 


Some programs also compute the corresponding p-values that are found from the 
expression (11.5.22). 


Power of the Test If the null hypothesis in (11.5.20) is false, then the statistic U; 


has the noncentral ¢ distribution with n — p degrees of freedom and noncentrality 


parameter w = (6; — B/G; 0). Plots such as those in Figures 9.12 and 9.14 or 


computer programs can be used to calculate the power of the ¢ test for specific 
parameter values. 


Prediction 


Let z' = (zo, ... , Zp_1) be a vector of predictors for a future observation Y. We wish 


to predict Y using Y = z’B, and we want to know the M.S.E. We shall assume that Y 
is independent of the observed data. This makes Y and Y independent. We can write 


Y =z p=2z'(Z'Z) ZY, 


so that Y is a linear combination of the original data Y. Since the coordinates of Y are 
independent normal random variables, Theorem 11.3.1 tells us that Y has a normal 
distribution. The mean of Y is easily seen to be 


EP) =2'E(B) =2'B. 
The variance of Y is obtained from Theorem 11.5.2: 
Var(Y ) =z'(Z'Z) !Z' Cov(¥ )Z(Z'Z)~!z 
= 2z'(Z'Z)z0?. 
Since Y has the normal distribution with mean z’B and variance o” and is independent 
of Y, it follows that Y — Y has the normal distribution with mean 0 and variance 


Var(¥ — ¥) = Var(¥) + Var(Y) = ot ai ¢Z'Z) 2]. (11.5.24) 


Since Y — Y has mean 0, Eq. (11.5.24) is also the M.S.E. for using Y to predict Y. 
We can also form a prediction interval for Y just as we did in (11.3.25). As we 
did there, define 


_y 2 
L= iets » W S 
ofl + 2'(Z'Z)1z]t/2 o2 


Then Z has the standard normal distribution independent of W, which has the x? 
distribution with n — p degrees of freedom. Hence, 


Z _ Vey 
(W/[n — p)V/2 ~~ o [1 4+.2'(Z'Z)-1z]}2 


has the ¢ distribution with n — p degrees of freedom. It follows that the interval with 
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Example 
11.5.7 


Example 
11.5.8 


the following endpoints has probability 1 — ag of containing Y, prior to observing the 
data: 


x 12 
P4T2 (1 - <0) o'[1 + z(Z'2)"'z| (11.5.25) 


Predicting Dishwasher Shipments. In Example 11.5.4, the least-squares estimates for 
the model are By = —1314, 8; = 66.91, and B, = 58.86. The observed value of o’ 
is 352.9. Now suppose that we are interested in predicting dishwasher shipments 
for 1986. We happen to know that in 1986 private residential investment was 67.2 
billion. In order to predict dishwasher shipments for 1986, we first form the vector 
of predictors z’ = (1, 26, 67.2). Then we compute Y = z/B = 4381 and 


o [1+ 2'(Z'Z)~1z]!/? = 352.9[1 + 0.2136]!/? = 388.8. 


We can now compute a prediction interval for 1986 dishwasher shipments. For ex- 
ample, with ag = 0.1, we get a 90 percent prediction interval using T 53 (0.95) = 1714, 


(4381 — 1.714 x 388.8, 4381 + 1.714 x 388.8) = (3715, 5047). 


This is quite a wide range due to the large value of o’. The actual value for dishwasher 
sales in 1986 was 3915, which is quite far from Y, but still within the interval. < 


Multiple R2 


In a problem of multiple linear regression, we are typically interested in determining 
how well the variables X;,..., X, explain the observed variation in the random 
variable Y. The variation among the n observed values y,..., y, of Y can be 
measured by the value of 37”_,(y; — y)*, which is the sum of the squares of the 
deviations of y,,..., y, from the average y. Similarly, after the regression of Y on 
X,,..., X, has been fitted from the data, the variation among the n observed values 
of Y that is still present can be measured by the sum of the squares of the deviations of 
yj,---, Y, from the fitted regression. This sum of squares will be equal to the value 
of S? in Eq. (11.5.6) calculated from the observed values, i.e., S? = ae vy, 
where 5; = Bo + Bixi +--+ + Beri. 

It now follows that the proportion of the variation among the observed values 


y,,---, y, that remains unexplained by the fitted regression is 

ya: -_ 5)" 

wiaGe= y? 
In turn, the proportion of the variation among the observed values y,,..., , that is 
explained by the fitted regression is given by the following value R?: 

n rn 
pean y — ie hi r (1.5.26) 
pee —y) 


Unemployment in the 1950s. For the data in Example 11.5.1, we can compute V9 = 
2.82, and then 1°, (y; — ¥49)* = 8.376. The value of S? is (10 — 3) x o/* = 1.126, so 
R? =1—1.126/8.376 = 0.8656. < 


The value of R* must lie in the interval 0 < R? <1. When R? =O, the least- 
squares estimates have the values By = y and 6, =-- - = 6, = 0. In this case, the fitted 
regression function is just the constant function y = y. When R? is close to 1, the 


Example 
11.5.9 


Figure 11.17 Plots of resid- 
uals against the two predictor 
variables for Example 11.5.9. 
Top row: using all data for 
1950-1959. Bottom row: us- 
ing only 1951-1959 data. 
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variation of the observed values of Y around the fitted regression function is much 
smaller than their variation around y. 


Analysis of Residuals 


In Sec. 11.3, we described some plots for assessing whether or not the assumptions of 
the simple linear regression model seem to be met. These same plots, together with 
some others, are also useful in the general linear model. Recall that, in general, the 
residuals are the values 


ep = ¥f = Ys = Yr = 20PQ = * = Zip pt 


Unemployment in the 1950s. In this example, p =3 with z;) = 1 for all i. We have 
plotted the residuals against the two predictor variables in the top row of Fig. 11.17 
to begin looking for violations of the assumptions. The residual from the first year 
(1950) is very high, and the remaining residuals appear to lie near a line with positive 
slope in each plot. This suggests that the first observation does not follow the same 
pattern as the others. We also performed the regression without the 1950 data point. 
The residual plots using the new least-squares estimates fit from the 1951-1959 data 
are in the bottom row of Fig. 11.17. The residuals for 1951-1959 no longer lie on 
a sloped line. Also, Fig. 11.18 shows normal quantile plots both before and after 
deleting the 1950 observation. The right plot is much straighter. Of course, such a 
graphical analysis does not show that the 1950 observation should be deleted. We 
should check to see if something might have occurred in 1950 that would make a 
drastic change to the relationship between unemployment and time (such as the start 
of the war in Korea.) < 


Another plot that is useful in multiple regression cases is a plot of residuals 
against fitted values, }; fori =1,..., 2. (See Exercise 27 to see why this plot is not 
used in simple linear regression.) This plot helps to reveal dependence between the 
mean and variance of Y. (Recall that 3; is an estimate of the mean of Y;.) If the resid- 
uals are more spread out at one end or the other of this plot, it suggests that the 
variance of Y changes as the mean changes, which violates the assumption that all 
observations have the same variance. The left plot in Fig. 11.19 is a plot of residuals 
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Figure 11.18 Normal quan- 
tile plots of residuals for 
Example 11.5.9. The left plot 
is from the regression using 
all 10 observations. The right 
plot uses only 1951-1959. 


Figure 11.19 Residual plots 
for Example 11.5.9. Left: plot 
of residuals against fitted 
values. Right: plot of pairs of 
consecutive residuals. Both 
plots use 1951-1959 data only. 
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11.5.10 
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against fitted values for the unemployment data. It appears that the residuals corre- 
sponding to low fitted values are more spread out than those corresponding to high 
fitted values. Methods for responding to such features in a residual plot can be found 
in texts on regression methodology such as Draper and Smith (1998) and Cook and 
Weisberg (1999). 

If the time of each measurement is available, as in Examples 11.5.1 and 11.5.4, 
it makes sense to plot residuals against time to see if there is any time dependence 
not captured by the model. Since time was one of the predictors in each of these 
examples, we will plot residuals against time when we plot residuals against the 
predictors. In addition to plotting the residuals against time, we can also plot the 
nearby residuals against each other to see if small ones tend to occur together and/or 
if large ones tend to occur together. Let v,,..., v, be the residuals ordered by time. 
We can plot the n — 1 points (v4, v2), (v2, 3), ..-, (Vp_1, Up). If these plotted points 
follow a pattern, it suggests that there is dependence between observations that are 
close together in time, called serial dependence. This would violate the assumption 
that the observations are independent. The right plot in Fig. 11.19 is the plot of 
consecutive pairs of residuals for the unemployment data. The points in this plot 
cluster in opposite corners, suggesting serial dependence, although the small sample 
size makes it difficult to be certain. 


Dishwasher Shipments. Consider, again, the data from Example 11.5.4. Plots of resid- 
uals against the two predictors, in the top row of Fig. 11.20, reveal a serious problem. 
There is a curve in the plot of residuals against the year. The residuals are highest in 
the middle years and lower in the early and late years. This suggests that perhaps the 
relationship between shipments and time is not linear. The plot of pairs of consecu- 


Figure 11.20 Residual plots 
for Example 11.5.10. Top row: 
residuals against predictors. 
Lower left: residuals against 
fitted values. Lower right: 
pairs of successive residuals. 


Figure 11.21 Residual plots 
for regression of dishwasher 
shipments on a quadratic 
function of time. Left: plot of 
residuals against time. Right: 
plot of pairs of consecutive 
residuals. 
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tive residuals also suggests some time dependence. This could be a result of the same 
problem that caused the curve in the plot of residuals against time, or it could indi- 
cate that successive observations are dependent. It is possible that deviations from 
the overall trend in dishwasher sales might persist for more than one year. For exam- 
ple, a boom or bust in sales one year might carry over to part of the next year. The 
normal quantile plot (not shown) is fairly straight. 

In order to try to determine whether there is serial dependence or a nonlinear 
relationship (or both) in these data, we fit another model in which the mean of Y is 
a linear function of private residential investment but a quadratic function of time. 
That is, let X, stand for the year (minus 1960), let X, stand for private residential 
investment, and let X3 = X - Then 


E(Y) = By + B1X1 + BoX2 + B3X}. 


The least-squares estimates from this model are by = —1445, B; = 206.1, By = 48.5, 
and £; = —5.23. The observed value of o’ is 235.7. The plots of residuals against time 
and of consecutive pairs of residuals are in Fig. 11.21. The plot of residuals against 
time is better than before, but the pairs of consecutive residuals still lie close to a line. 
This suggests that we need to take serial dependence into account. One book that 
describes methods for dealing with serial dependence (commonly called time series 
analysis) is Box, Jenkins, and Reinsel (1994). < 
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Summary 


In the general linear model, we assume that the mean of each observation Y; can be 
expressed as z;9By + +--+ Zjp—1Bp—1, where Bo, ..., Bp; are unknown parameters 
and zj9,..., Zjp—1 are the observed values of predictors. These predictors can be 
control variables, other variables that are measured along with Y;, or functions of 
such variables. Least-squares estimators of the parameters are denoted Bo: £8 pts 
and they can be calculated according to Eq. (11.5.10) or by using a computer. The 
variance of each Y, is assumed to be the same value o”. Every linear combination 
of the least-squares estimators has a normal distribution and is independent of the 
unbiased estimator 0” of o? given in Eq. (11.5.8). 

For testing hypotheses about a single 6, the statistic U; in Eq. (11.5.21) has the 
t distribution with n — p degrees of freedom given that the null hypothesis is true. 
For predicting a future Y value, we can form prediction intervals using the endpoints 
given by (11.5.25). We should always plot the residuals y; — ); against the predictors, 
fitted values };, and time (if available) to check on the assumptions of the linear 
regression model. Patterns in these plots can suggest violations of the assumption 
about the form of the mean of Y; and/or the constant variance assumption. We should 
also make a normal quantile plot. Deviations from a straight line in this plot suggest 
that the Y; values might not have a normal distribution, although violations of the 
assumptions about the mean and variance can also cause patterns in this plot. If 
observation time is available, we should also plot pairs of consecutive residuals to 
look for serial dependence. 


Exercises 


1. Show that the M.L.E. of o? in the general linear model 
is given by Eq. (11.5.7). 


2. Prove that o’”, defined in Eq. (11.5.8), is an unbiased 
estimator of o”. You may assume that S*/¢2 hasa x? 
distribution with n — p degrees of freedom. 


3. Consider a regression problem in which, for each value 
x of a certain variable X, the random variable Y has the 
normal distribution with mean fx and variance o”, where 
the values of 6 and o” are unknown. Suppose that n inde- 
pendent pairs of observations (x;, Y;) are obtained. Show 
that the M.L.E. of £ is 


4, For the conditions of Exercise 3, show that E(A) =f 
and Var(B) = ei 4 a): 


5. Suppose that when a small amount x of an insulin 
preparation is injected into a rabbit, the percentage de- 
crease Y in blood sugar has the normal distribution with 
mean fx and variance o?, where the values of 6 and o? 
are unknown. Suppose that when independent observa- 
tions are made on 10 different rabbits, the observed values 
of x; and Y; fori =1,..., 10 are as given in Table 11.14. 


Determine the values of the M.L.E.’s B and 62, and the 
value of Var(). 


Table 11.14 Data for Exercise 5 


i Xj Yi i Xj Yi 
1 0.6 6 2.2 19 
2 1.0 i 2.8 9 
3 1.7 8 3.5 14 
4 1.7 11 9 3.5 22 
5 2.2 10 10 42 22 


6. For the conditions of Exercise 5 and the data in Table 
11.14, carry out a test of the following hypotheses: 

Ao 7 Bp => 10, 

Hy: B#10. 
7. Consider a regression problem in which a patient’s re- 
action Y to a new drug B is to be related to his reaction 
X to a standard drug A. Suppose that for each value x 
of X, the regression function is a polynomial of the form 
E(Y) = Bo + Bix + Box”. Suppose also that 10 pairs of ob- 
served values are as shown in Table 11.1 on page 690. Un- 
der the standard assumptions of the general linear model, 
determine the values of the M.L.E.’s Bo, 8, fo, and 62. 


8. For the conditions of Exercise 7 and the data in Table 
11.1, determine the values of Var (Bo), Var(;), Var (A>), 


Cov(6p, Bi), Cov(Bo, B), and Cov(Ay, Bo). 


9. For the conditions of Exercise 7 and the data in Table 
11.1, carry out a test of the following hypotheses: 


Ho: Bo = 0, 
Ay: po #0. 


10. For the conditions of Exercise 7 and the data in 
Table 11.1, carry out a test of the following hypotheses: 


Ap: py= 
Ay: py #4. 


11. For the conditions of Exercise 7 and the data given 
in Table 11.1, determine the value of R2, as defined by 
Eq. (11.5.26). 


12. Consider a problem of multiple linear regression in 
which a patient’s reaction Y to anew drug B is to be related 
to her reaction X, to a standard drug A and her heart rate 
X>. Suppose that, for all values X; = x; and X> = x9, the 
regression function has the form E(Y) = Bo + Byx1 + Box, 
and the values of 10 sets of observations (x;1, xj2, Y;) 
are given in Table 11.2 on page 696. Under the standard 
assumptions of multiple linear regression, determine the 
values of the M.L.E.’s Bo, A, B>, and 62. 


13. For the conditions of Exercise 12 and the data in 
Table 11.2, determine the values of Var(Bo),, Var(f1), 


Var (By), Cov(Bo, 81), Cov(Bo, b2), and Cov( fy, By). 


14. For the conditions of Exercise 12 and the data in 
Table 11.2, carry out a test of the following hypotheses: 


Ap: py= 
Ay: Bp, #0. 


15. For the conditions of Exercise 12 and the data in 
Table 11.2, carry out a test of the following hypotheses: 


Ay: ~,=—1, 
Ay: py #-1. 


16. For the conditions of Exercise 12 and the data in 
Table 11.2, determine the value of R2, as defined by 
Eq. (11.5.26). 


17. Consider the general linear model in which the ob- 
servations Y;,..., Y,, are independent and have normal 
distributions with the same variance o? and in which E(Y;) 
is given by Eq. (11.5.1). Let the matrix (Z'Z)~! be defined 
by Eq. (11.5.19). For all values of i and j such that i 4 j, 
let the random variable A;; be defined as follows: 
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Show that Cov(B;, Aj;) =0, and explain why B; and Aj; 
are therefore independent. 

18. For the conditions of Exercise 17, show that Var(A; ij 
=[Gj; - (62./£;)\e*. Also show that the following random 


variable W has the x? distribution with two degrees of 
freedom: 


£5 (Bi Bi)* 4 


&i(B; — Bj)? — 26;;(B; 
(cist; — $3) 0 


wu BiN(Bj — Bi) 
Hint: Show that 


[Ai - EAD) 
Var(A;;) 


(By — Bi? | 


6jj07 


w= 


19. Consider again the conditions of Exercises 17 and 
18, and let the random variable o’ be as defined by Eq. 
(11.5.8). 


a. Show that the random variable o2W2 / (20'7) has the 
F distribution with two and n — p degrees of free- 
dom. 


b. For every two given numbers £* and Bi, describe how 
to carry out a test of the following hypotheses: 
Ho: 6; = B% and B; = B*, 
H,: The hypothesis Hp is not true. 


20. For the conditions of Exercise 7 and the data in 
Table 11.1, carry out a test of the following hypotheses: 
Ay: Bi =b,= 
H,: The ion Ap is not true. 


21. For the conditions of Exercise 12 and the data in 
Table 11.2, carry out a test of the following hypotheses: 


Ho: 6, = 1 and p, =0, 
H,: The hypothesis Hp is not true. 
22. Consider a problem of simple linear regression as de- 


scribed in Sec. 11.2, and let R? be defined by Eq. (11.5.26) 
of this section. Show that 


pe hate 


= —\72 
7 —*)0; —¥)] 
[SiG — 3)7] [L711 - HD] 
23. Suppose that X and Y are n-dimensional random vec- 


tors for which the mean vectors E(X) and E(Y) exist. 
Show that E(¥ + Y)= E(X)+ E(Y). 


24. Suppose that X and Y are independent n-dimensional 
random vectors for which the covariance matrices Cov(X) 
and Cov(Y ) exist. Show that Cov(¥ + Y ) = Cov(X) + 
Cov(Y ). 
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25. Suppose that Y is a three-dimensional random vec- 
tor with coordinates Y;, Y,, and Y3, and suppose that the 
covariance matrix of Y is as follows: 


9 -3 0 
Cov(Y) =| —3 4 0 
0 O05 


Determine the value of Var(3Y; + Y2 — 2Y3 +8). 


26. Ina general linear model setting with p predictors, we 


has the r distribution with n — p degrees of freedom. 
c. Explain how to test the hypotheses in (11.5.27) at 
level of significance ao. 


27. In a simple linear regression problem, the plot of 
residuals against fitted values would look the same as the 
plot of residuals against the predictor X (or a mirror im- 
age of it), except for the labeling of the horizontal axis. 
Explain why this is true. 


wish to test the following hypotheses: 


p-1 


28. Consider a multiple linear regression problem with 
design matrix Z and observations Y. Let Z, be the matrix 
remaining when at least one column is removed from Z. 


Ao: > Cj Bj =Cxs Then Z; is the design matrix for a linear regression prob- 
j=0 (1.5.27) lem with fewer predictors and the same data Y. Prove that 
p-1 ~~ the value of R? calculated in the problem using design ma- 

Ay: > CjB) A Ce trix Z is at least as large as the value of R* calculated in 
j=0 the problem using design matrix Z}. 


a. Show that 7?) c ;6; has a normal distribution and 


j=0 


find its mean and variance. (You may wish to use 
Theorems 11.3.1 and 11.5.2.) 


b. Let c’ = (co, . 


29. Calculate the value of R* for the dishwasher shipment 
data (Example 11.5.4) using the model in which the mean 
of Y; is a linear function of both year and private residen- 
tial investment. 


.-, Cp—1)- If Ap is true, show that 


p-l 4 
5-0 CiBj — Cs 


30. Consider again the conditions of Exercise 26. Suppose 
that the null hypothesis in (11.5.27) is false. Find the dis- 


~~ o'(e'(Z'Z)—1e)1/2 


Example 
11.6.1 


tribution of the statistic U defined in that exercise. 


11.6 Analysis of Variance 


In Sec. 9.6, we studied methods for comparing the means of two normal distribu- 
tions. In this section, we shall consider experiments in which we need to compare 
the means of two or more normal distributions. The theory behind the methods de- 
veloped here is based entirely on results from the general linear model in Sec. 11.5. 


The One-Way Layout 


Calories in Hot Dogs. Moore and McCabe (1999) describe data gathered by Consumer 
Reports (June 1986, pp. 364-67). The data comprise (among other things) calorie 
contents from 63 brands of hot dogs. (See Table 11.15.) The hot dogs come in four 
varieties: beef, “meat” (don’t ask), poultry, and “specialty.” (Specialty hot dogs 
include stuffing such as cheese or chili.) It is interesting to know whether, and to 
what extent, the different varieties differ in their calorie contents. Data structures of 
the sort in this example, consisting of several groups of similar random variables, are 
the subject of this section. < 


In this section and in the remainder of this chapter, we shall study a topic 
known as the analysis of variance, abbreviated ANOVA. Problems of ANOVA 
are actually problems of multiple regression in which the design matrix Z has a 
very special form. In other words, the study of ANOVA can be placed within the 
framework of the general linear model (Definition 11.5.1), if we continue to make 


Example 
11.6.2 
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Table 11.15 Calorie counts in four types of hot dogs for Example 11.6.2 

Type Calorie Count 

Beef 186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 152, 111, 
141, 153, 190, 157, 131, 149, 135, 132 

Meat 173, 191, 182, 190, 172, 147, 146, 139, 175, 136, 179, 153, 
107, 195, 135, 140, 138 

Poultry 129, 132, 102, 106, 94, 102, 87, 99, 107, 113, 135, 142, 86, 
143, 152, 146, 144 

Specialty 155, 170, 114, 191, 162, 146, 140, 187, 180 


the basic assumptions for such a model: The observations that are obtained are 
independent and normally distributed; all these observations have the same variance 
o”; and the mean of each observation can be represented as a linear combination of 
certain unknown parameters. The theory and methodology of ANOVA were mainly 
developed by R. A. Fisher during the 1920s. 

We shall begin our study of ANOVA by considering a problem known as the 
one-way layout. In this problem, it is assumed that random samples from p different 
normal distributions are available, each of these distributions has the same variance 
o”, and the means of the p distributions are to be compared on the basis of the 
observed values in the samples. This problem was considered for two populations 
(p = 2) in Sec. 9.6, and the results to be presented here for an arbitrary value of 
p will generalize those presented in Sec. 9.6. Specifically, we shall now make the 
following assumption: For i = 1,..., p, the random variables Y;1,..., Yj,,, form a 
random sample of n; observations from the normal distribution with mean jy; and 
variance o2, and the values of 4, ..., Z and o? are unknown. 

In this problem, the sample sizes nj,...,, are not necessarily the same. We 
shall let n = 4 n; denote the total number of observations in the p samples, and 
we shall assume that all n observations are independent. 


Calories in Hot Dogs. In Example 11.6.1, the sample sizes are n; = 20 (beef), n, = 17 
(meat), n3 = 17 (poultry), and n4 = 9 (specialty). In this case, we let 4, stand for the 
mean calorie count for brands of beef hot dogs, while jz, j43, and jz4 will stand for the 
mean calorie count for brands of meat, poultry, and specialty hot dogs, respectively. 
All calorie counts are assumed to be independent normal random variables with 
variance o*. These data will be analyzed after we develop the ANOVA methodology. 

< 


It follows from the assumptions we have just made that for 7 =1,...,; and 
i=1,..., p, we have E(Y;;) = w; and Var(¥,;) = o?. Since the expectation E(¥;;) of 
each observation is equal to one of the p parameters j11, ..., 4p, it is obvious that 
each of these expectations can be regarded as a linear combination of j11, ..., Up- 
Furthermore, we can regard the n observations Y;; as the elements of a single long 
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n-dimensional vector Y, which can be written as follows: 


y=| : |. (11.6.1) 


Yon, 

This one-way layout therefore satisfies the conditions of the general linear model. 
In order to make the one-way layout look exactly like the general linear model, we 
could define parameters 6; = ;4, fori =0,..., p — 1. Then then x p design matrix, 
Z, has one column for each population. The column corresponding to population 1 
has n; 1’s followed by nz + --- +7, 0’s. The column corresponding to population 2 
has n, 0’s followed by nz 1’s followed by n3 + --- +n, 0’s, and so on. For example, 
using the hot dog data in Example 11.6.2, the Z matrix would be 


10 0 0 
: | 20 rows 
iL 0 0 
0 0 0 
| 17 rows 
Z= aes (11.6.2) 
0 0 1 0 
: | 17 rows 
0 0 1 0 
000 1 
: | 9 rows 
000 1 
We shall not use the general linear model notation any further in the development 
of ANOVA, because the parameters j14,..., p are more natural. 
Fori=1,..., p, we shall let Y;, denote the sample mean of the n; observations 


in the ith sample. Thus, 
Yiz=—)0 ij. (11.6.3) 


Similar logic to that used in the proof of Theorem 11.2.1 can be used to show that 


Y;4 is the M.L.E., or least-squares estimator, of jz; fori =1,..., p. Also, the M.L.E. 
of o? is 


(%; = ame (11.6.4) 


The details are left to Exercise 1. 


Example 
11.6.3 


Theorem 
11.6.1 
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Partitioning a Sum of Squares 


Calories in Hot Dogs. In Examples 11.6.1 and 11.6.2, we notice that the calorie counts 
within each type differ quite a bit from each other. We need to be able to quantify 
both the variation within type and the variation between types if we are going to try 
to address the question of whether or not different types of hot dogs have the same 
calorie counts. | 


In a one-way layout, we are often interested in testing the hypothesis that the p 
distributions from which the samples were drawn are actually the same; that is, we 
desire to test the following hypotheses: 


Ho: py Bs 


‘ ; (11.6.5) 
H,: The hypothesis Hp is not true. 


For instance, in Example 11.6.2, the null hypothesis Hp in (11.6.5) would be that the 
mean calorie counts for all four types of hot dogs are the same, but it would not 
specify what the common value is. The alternative hypothesis H, would be that at 
least two of the means differ, but it would not specify which means differ nor would 
it specify by how much the means differ. 

Before we develop an appropriate test procedure, we shall carry out some 
preparatory algebraic manipulations. First, define 


p nj Pp 


Yuy= : y= Vis, 
pa pai Wet 


whichis the overall average of alln observations. We shall partition the sum of squares 


p nj 


STot = Ss xe7 = Via) (11.6.6) 


i=1 j=l 


into two smaller sums of squares, each of which will be associated with a certain type 
of variation among the n observations. Note that Se /n would be the M.L.E. of o? if 
we believed that all of the observations came from a single normal distribution rather 
than from p different normal distributions. This means that we can interpret oe as 
an overall measure of variation between the n observations. One of the smaller sums 
of squares into which we shall partition Se will measure the variation between the p 
different samples, and the other sum of squares will measure the variation between 
the observations within each of the samples. The test of the hypotheses (11.6.5) that 
we shall develop will be based on the ratio of these two measures of variation. For 
this reason, the name analysis of variance has been used to describe this problem and 
other related problems. 


Partitioning the Sum of Squares. Let S be as defined in Eq. (11.6.6). Then 


2 2 2 
STot = SResid + SBetw? (11.6.7) 


where 
pn Dp 


2 a AD, 2 = = 2 
oped = » > ij —Y¥i,)", and Spey = y nj(Vj4—Y44)". 
= _=| = 


Furthermore, Spats /o* has the x? distribution with n — p degrees of freedom and is 


independent of ss eg 
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Table 11.16 General form of ANOVA table for one-way layout 
Source of Degrees of 
variation freedom Sum of squares Mean square 
Between samples p-1 a So eal (P —1) 
Residuals n—p S2 si d Sed q/( — p) 
Total n—-1 Sy 


Proof If we consider only the n,; observations in sample i, then the sum of squares 
for those values can be written as follows: 


nj nj 


x07 = = x07 ~Y¥j4)° +0j(Vi4 —¥ 44)’. (11.6.8) 
j=l j=l 


It follows from Theorem 8.3.1 that the sum forming the first term on the right side 


of Eq. (11.6.8), when divided by o?, has the x” distribution with n; — 1 degrees of freedom and that it 


is independent of Y;,. Since Y,, is a function of Y1,,..., Y,4, all of which are 
independent of the first term on the right side of Eq. (11.6.8), it follows that the two 
terms on the right side of Eq. (11.6.8) are independent. 
If we now sum each of the terms in Eq. (11.6.8) over the values of i, we obtain 
Eq. (11.6.7). Since all the observations in the p samples are independent, the two 
terms on the right side of Eq. (11.6.7) are independent. Also, Ce is the sum 
of p independent random variables, with the ith one having the x? distribution with 
n; — 1 degrees of freedom. Hence, So ea /o* will itself have the x? distribution with 
P_,(n; — 1) =n — p degrees of freedom. a 
As we noted earlier, aM 
around their overall mean. Similarly, Se ai q can be regarded as the total variation of 
the observations around their particular sample means, or the total residual variation 
within the samples. Also, 2 etw Can be regarded as the total variation of the sample 
means around the overall mean, or the variation between the sample means. Thus, 
the total variation cn has been partitioned into two independent components, SS 


can be regarded as the total variation of the observations 


esid 
and Bice which represent different types of variations. This partitioning is often 
summarized in a table, which is called the ANOVA table for the one-way layout and 
is presented here as Table 11.16. 

The numbers in the “Mean square” column of Table 11.16 are just the sums of 
squares divided by the degrees of freedom. They are used for testing the hypotheses 
(11.6.5). The degrees of freedom in the “Between samples” and “Total” rows will 
turn out to be degrees of freedom for random variables with x? distributions if the 
null hypothesis in (11.6.5) is true. We shall see why this is true after we develop an 
appropriate test of the hypotheses (11.6.5). 


Note: The Residual Mean Square Is the Same as the Unbiased Estimator of o? in 
the Regression Setting. We began this section by expressing the one-way layout as a 
multiple linear regression problem with data vector Y and design matrix Z. Compare 
the M.L.E. of 07, 6? in Eq. (11.6.4), to the residual mean square in Table 11.16 to see 
that the two differ only in the constant in the denominator. The M.L.E. is ae q/M; 


Example 
11.6.4 


Theorem 
11.6.2 
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Table 11.17 ANOVA table for Example 11.6.4 
Source of Degrees of 
variation freedom Sum of squares Mean square 
Between samples 3 19,454 6485 
Residuals 59 32,995 559.2 
Total 62 52,449 


while the residual mean square is Ne esiq/(” — Pp). Recall that this last ratio was called 
o” in Sec. 11.5, and is an unbiased estimator of o”. (Prove this last fact in Exercise 8.) 


Calories in Hot Dogs. The four sample averages in Example 11.6.2 are 
Fie 215085, Yoens 158.71, Yao e 11876, Veo= 16056. 


The overall average is Y,, = 147.60. We can now form the ANOVA table in 
Table 11.17. We shall test the hypotheses (11.6.5) after we develop an appropriate 
test statistic. 4 


Testing Hypotheses 


In order to test the hypotheses (11.6.5), we need a test statistic that will tend to be 
larger if H; is true than it is if Hp is true. We also need to know the distribution of the 
test statistic when A) is true. 


Suppose that Ho in (11.6.5) is true. Then 
2 SBetw/(P eed) 


v= eer 11.6.9 
SResia/ (” =P) 


has the F distribution with p — 1 andn — p degrees of freedom. 


Proof If all p samples of observations have the same mean, it can be shown (see 
Exercise 2) that So cle has the x? distribution with p — 1 degrees of freedom. 
We have already seen that S, etw 1S independent of SB esig? and Sead /o* has the x? 
distribution with n — p degrees of freedom. It therefore follows that when Hp is true, 


U” has the distribution stated in the theorem. | 


When the null hypothesis Ho is not true, so that at least two of the yw; values are 
different, then the expectation of the numerator of U? will be larger than it would 
be if Hp were true. (See Exercise 11.) The distribution of the denominator of U? 
remains the same regardless of whether or not Hp is true. A sensible level wo test of 


the hypotheses (11.6.5) would then be to reject Ho if U2 > eae n—p(l — %), where 
Fo! 


nes is the quantile function for the F distribution with p — 1 andn — p degrees 
of freedom. A partial table of F distribution quantiles is given in the back of this book. 
It can be shown that this test is also the level ag likelihood ratio test procedure. (See 
Exercise 12.) 
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Example 
11.6.5 


Example 
11.6.6 


Figure 11.22 Plot of resid- 
uals against hot dog type. 


Calories in Hot Dogs. Suppose that we desire to test the null hypothesis that all four 
types of hot dogs have the same mean calorie count against the alternative hypothesis 
that at least two types have different means. The statistic U? in Eq. (11.6.9) has 
the F distribution with 3 and 59 degrees of freedom if the null hypothesis is true. 
The observed value of U? is the ratio of the between samples mean square to the 
residual mean square from Table 11.17, namely, 6485/559.2 = 11.60. The p-value 
corresponding to this value is 4.5 x 10~°, so the null hypothesis would be rejected at 
most standard levels. < 


Power of the Test If the null hypothesis in (11.6.5) is false, then the statistic U? in 
Eq. (11.6.9) has a distribution known as noncentral F. For more details on the power 
function, consult a more advanced text such as Scheffé (1959, chapter 2). We shall 
not discuss the power of ANOVA tests any further. 


Analysis of Residuals 


Since the one-way layout is a special case of the general linear model, we make the 
assumptions of the general linear model when we perform the one-way ANOVA cal- 
culations. We should also compute residuals and plot them to see if the assumptions 
appear reasonable. The residuals are the values e;; = ¥;; — Y,+, for j=1,...,n; and 
PH 1, ik De 


Calories in Hot Dogs. Figure 11.22 contains a plot of residuals against the categorical 
variable “hot dog type.” Figure 11.23 contains the plot of residuals against normal 
quantiles. The points in the normal quantile plot are labeled by the hot dog type. 
Several disturbing features appear in these plots. First, there are three residuals with 
large negative values. Second, each of the first three samples appears to contain two 
distinct subsets, one with low residuals and one with high residuals. There is a gap 
between the two subsets in each sample. This suggests that there is another variable 
that we haven’t discussed yet but which distinguishes these two subgroups. If we go 
back to the reported data (in the original Consumer Reports article), we find that the 
weight of each package and the number of hot dogs per package are also reported. 
The ratio of these two numbers is the weight of an average hot dog. Figure 11.24 
contains a plot of residuals against average hot dog weight. Notice that most of the 
large residuals come from the larger (heavier) hot dogs and the smaller residuals tend 
to come from the smaller (lighter) hot dogs. Perhaps a better analysis would have set 


Y equal to calories per ounce rather than calories per hot dog. < 
Residuals \ 
e 
e r} e 
8 ° 
e 
20+ ¢ : i * 
3 3 
e 
Type of hot dog 
: Beef ; Meat 4 Poultry Bt Specialty 
® 8 i. 
—20 + 8 | : e 
e e 
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—40 —_ 
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Figure 11.23 Plot of residu- ae Residuals 4 

als against normal quantiles. Meat 

The points are labeled by the : eats 200+ oA 
= Specialty Pi 


hot dog type. 
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nano * 


Normal quantiles 


1 2 


> 0 @ © 2 Dn peop reacer 


Weight of average hot dog 


The one-way layout can be considered as a general linear model, and we can use 
the methods of Sec. 11.5 to fit the model. However, the hypotheses of most interest 
in the one-way layout are (11.6.5). These hypotheses concern more than one linear 
combination of the regression coefficients, and they are not a special case of the 
hypotheses that we learned how to test in Sec. 11.5. To test these new hypotheses, 
we developed the analysis of variance (ANOVA) and the ANOVA table. The test 
statistic is U? in Eq. (11.6.9), which has the F distribution with p — 1 and n — p 
degrees of freedom if Hp is true. The level ap test of Ho is to reject Ho if U7 is greater 
than the 1 — ap quantile of the appropriate F distribution. 


oe 
—40 hs. 
Figure 11.24 Plot of resid- o Beef 
uals against average hot dog aoe 
weight. The points are labeled — Residuals A | | Sneciiby 
by the hot dog type. 
20 + 
12 14 
—20 + : 
8 
—49 + 
Summary 
Exercises 


1. Ina one-way layout, show that Y;, is the least-squares 
estimator of 4; by showing that the ith coordinate of the 
vector (Z'Z)"'Z'Y is ¥;, fori=1,..., p. 


2. Assume that Ho in (11.6.5) is true; that is, all observa- 


tions have the same mean w. Prove that Se ae has the 


x? distribution with p — 1 degrees of freedom. Hint: Let 


12> 
ny (Yi4— W/o 
X= : ‘ 
1/27 
ni (Y p4— W/o 
then use the same method that was used in Sec. 8.3 to find 


the distribution of the sample variance. You may use the 
following fact without proving it: 


Letu= ((n,/n)¥?, ee (n,/n)'/?). Then there ex- 
ists an orthogonal matrix A whose first row is u. 
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3. Show that 
P P 


= = ) 2 =? 
Yo %i4—-Y44) =o a¥;, —n¥4,. 
i=1 i=1 


4. Specimens of milk from a number of dairies in three 
different districts were analyzed, and the concentration 
of the radioactive isotope strontium-90 was measured in 
each specimen. Suppose that specimens were obtained 
from four dairies in the first district, from six dairies in the 
second district, and from three dairies in the third district, 
and that the results measured in picocuries per liter were 
as follows: 


District 1: 6.4, 5.8, 6.5, 7.7, 
District 2: 7.1, 9.9, 11.2, 10.5, 6.5, 8.8, 
District 3: 9.5, 9.0, 12.1. 


a. Assuming that the variance of the concentration of 
strontium-90 is the same for the dairies in all three 
districts, determine the M.L.E. of the mean concen- 
tration in each of the districts and the M.L.E. of the 
common variance. 


b. Test the hypothesis that the three districts have iden- 
tical concentrations of strontium-90. 


5. A random sample of 10 students was selected from the 
senior class at each of four large high schools, and the score 
of each of these 40 students on a certain mathematics ex- 
amination was observed. Suppose that for the 10 students 
from each school, the sample mean and the sample vari- 
ance of the scores were as shown in Table 11.18. Test the 
hypothesis that the senior classes at all four high schools 
would perform equally well on this examination. Discuss 
carefully the assumptions that you are making in carrying 
out this test. 


Table 11.18 Data for Exercise 5 


School Sample mean Sample variance 
1 105.7 30.3 
2 102.0 54.4 
| 93.5 25.0 
4 110.8 36.4 


6. Suppose that a random sample of size n is taken from 
the normal distribution with mean y and variance o?. 
Before the sample is observed, the random variables are 
divided into p groups of sizes nj,...,,, where n; > 2 
fori=1,,..,pand }7,#;=n. Fori =1,..., p,let Q; 
denote the sum of the squares of the deviations of the 
n; observations in the ith group from the sample mean 
of those n; observations. Find the distribution of the sum 
Q,+---+Q, and the distribution of the ratio Q1/Q,. 


7. Verify that the ¢ test presented in Sec. 9.6 for comparing 
the means of two normal distributions is the same as the 
test presented in this section for the one-way layout with 
p =2 by verifying that if U is defined by Eq. (9.6.3), then 
U? is equal to the expression given in Eq. (11.6.9). 


8. Show that in a one-way layout the following statistic is 
an unbiased estimator of 0”: 


Pom 
Z Yo; - Fi). 


OS Beh 


9. In a one-way layout, show that for all values of i, i’, 
and j, where j =1,...,;,i=1,..., p,andi’=1,..., p, 
the following three random variables W,, W, and W3 are 
uncorrelated with each other: 


W=V¥ij-Vi4, Wo=Vire-Yu4, Wa=YVay. 
10. In 1973, the President of Texaco, Inc., made a state- 
ment to a US. Senate subcommittee concerned with air 
and water pollution. The committee was concerned with, 
among other things, the noise levels associated with au- 
tomobile filters. He cited the data in Table 11.19 from a 
study that included vehicles of three different sizes. 


Table 11.19 Data for Exercise 10 


Vehicle size Noise values 


Small 810, 820, 820, 835, 835, 835 
Medium 840, 840, 840, 845, 855, 850 
Large 785, 790, 785, 760, 760, 770 


a. Construct the ANOVA table for these data. 

b. Compute the p-value for the null hypothesis that all 
three sizes of vehicles produce the same level of noise 
on average. 


11. Assume that the null hypothesis Hp in (11.6.5) is false. 
Prove that the expected value of Se vise is (p— Lo? + 


77 cP 1 
tee 7)”, where i = Ly? nilti. 


12. Prove that the level a likelihood ratio test of hypothe- 
ses (11.6.5) in the one-way layout is to reject Hp if U2 > 
Fo in~pO — a). Hint: First, partition Oy — pj) in 
a manner similar to Eq. (11.6.8). Then, replace Y,, bya 
constant, say, 4, in the formula for ae and partition the 
result in a manner similar to Eq. (11.6.7). There will be 
one extra term. 


13. Suppose that the null hypothesis in (11.6.5) is true. 
Prove that Sg /o” has the x? distribution with n — 1 de- 
grees of freedom. 
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14. A popular alternative parameterization of the one- c. Prove that the null hypothesis Hp in (11.6.5) is equiv- 
way layout is the following. Let yp = + D1 titi, and de- alent toa =---=a,=0. 


fine a; = yw; — uw. This makes E(¥j;) = uw + qj. 


n 


d. Prove that the mean of SF ex, is (p — Do? + 


a. Prove that }°?_,n; a; =0. yao. 
b. Prove that the M.L.E. of a; is Yj. — Y44. 


Example 
11.7.1 
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In Sec. 11.6, we learned how to analyze several samples that differed in some 
characteristic. For example, we analyzed data collected from hot dogs that differed 
by the type of meat from which they were made. Suppose that, in addition to 
differing by the type of meat, the hot dogs had also differed by being labeled either 
“low fat” or not. This would have given us two different characteristics to form the 
basis for comparisons. In this section, we shall study how to analyze data consisting 
of observations that differ on two characteristics. 


The Two-Way Layout with One Observation in Each Cell 


Radioactive Isotope in Milk. Suppose that in an experiment to measure the concentra- 
tion of a certain radioactive isotope in milk, specimens of milk are obtained from four 
different dairies, and the concentration of the isotope in each specimen is measured 
by three different methods. If we let Y;; denote the measurement that is made for the 
specimen from the ith dairy by using the jth method, fori = 1, 2, 3, 4and j = 1, 2, 3, 
then in this example there will be a total of 12 measurements. There are two main 
questions of interest in this example. The first is whether the concentration of the 
isotope is the same in the milk of all four dairies. The second question is whether the 
three different methods produce concentration measurements that appear to differ. 

< 


A problem of the type in Example 11.7.1, in which the value of the random 
variable being observed is affected by two factors, is called a two-way layout. In 
the general two-way layout, there are two factors, which we shall call A and B. We 
shall assume that there are J possible different values, or different /evels, of factor 
A, and that there are J possible different values, or different /evels, of factor B. 
Fori=1,...,/ and j=1,..., J, an observation Y;,; of the variable being studied 
is obtained when factor A has the value i and factor B has the value j. If the L/ 
observations are arranged in a matrix as in Table 11.20, then Y;; is the observation in 
the (i, j) cell of the matrix. 

We shall continue to make the assumptions of the general linear model for the 
two-way layout. Thus, we shall assume that all the observations Y;; are independent, 
each observation has a normal distribution, and all the observations have the same 
variance o”. In this section, we specialize the assumption about the mean E(Y; j) as 
follows: We shall assume not only that E(Y;;) depends on the values i and j of the 
two factors, but also that there exist numbers 6),..., 6, and wy, ..., w, such that 


Thus, Eq. (11.7.1) states that the value of E(Y;;) is the sum of the following two 
effects: an effect 6; due to factor A having the value 7, and an effect y; due to factor B 
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Table 11.20 Generic data for two-way layout 
Factor B 
Factor A 1 2 ee J 
Yu Yi2 Yy 
2 Yo1 Y9 Yo] 
I Yr Y72 Yry 


having the value j. For this reason, the assumption that E(Y;;) has the form given in 
Eq. (11.7.1) is called an assumption of additivity of the effects of the factors. 

The meaning of the assumption of additivity can be clarified by the following 
example. Consider the sale of J different magazines at J different newsstands. Sup- 
pose that a particular newsstand sells on the average 30 more copies per week of 
magazine 1 than of magazine 2. Then by the assumption of additivity, it must also be 
true that each of the other J — 1 newsstands sells on the average 30 more copies per 
week of magazine 1 than of magazine 2. Similarly, suppose that the sales of a partic- 
ular magazine are on the average 50 more copies per week at newsstand 1 than at 
newsstand 2. Then by the assumption of additivity, it must also be true that the sales 
of each of the other J — 1 magazines are on the average 50 more copies per week at 
newsstand 1 than at newsstand 2. The assumption of additivity is a very restrictive 
assumption because it does not allow for the possibility that a particular magazine 
may sell unusually well at some particular newsstand. In Sec. 11.8, we shall consider 
models in which we do not make the assumption of additivity. 

Even though we assume in the general two-way layout that the effects of the 
factors A and B are additive, the numbers 6; and y; that satisfy Eq. (11.7.1) are 
not uniquely defined. We can add an arbitrary constant c to each of the numbers 
0,,..., 0, and subtract the same constant c from each of the numbers yy, ..., wy 
without changing the value of E(Y;;) for any of the // observations. Hence, it does not 
make sense to try to estimate the value of 6; or y; from the given observations, since 
neither 6; nor w; is uniquely defined. In order to avoid this difficulty, we shall express 
E(¥;;) in terms of different parameters. The following assumption is equivalent to the 
assumption of additivity. 

We shall assume that there exist numbers jz, a), ..., @,, and 6), ..., 6; such that 


I J 
\ Gat and > 2.0, (11.7.2) 
i=l 


j=l 
and 
E(Y,)=u+oa,+6; fori=1,...,2andj=1,...,J. (11.7.3) 


There is an advantage in expressing E(Y;;) in this way. If the values of E(Y;,) for 
i=l,...,/andj=1,..., J area set of numbers that satisfy Eq. (11.7.1) for some 
set of values of 6;,..., 0, and Wj, ..., wy, then there exists a unique set of values of 
fl, @,...,@,, and f,,..., B, that satisfy Eqs. (11.7.2) and (11.7.3) (see Exercise 3). 

The parameter ju is called the overall mean, or the grand mean, since it follows 
from Eqs. (11.7.2) and (11.7.3) that 


Theorem 
11.7.1 


Example 
11.7.2 
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IJ 
1 
= >>>) EM). (11.7.4) 
i=l j=l 
The parameters aj,..., a, are called the effects of factor A, and the parameters 


fy,..., By, are called the effects of factor B. 

It follows from Eq. (11.7.2) that a; = ae a; and B; = = B;. Hence, each 
expectation E(Y;;) in Eq. (11.7.3) can be expressed as a particular Tineas combination 
ofthe 7+ J — ie aes ft, Q4,...,@,_4,and f;,..., B;_1. Therefore, if we regard 
the JJ observations as elements of a single long LJ “dimensional vector, then the two- 
way layout satisfies the conditions of the general linear model. Ina practical problem, 
however, it is not convenient to actually replace a, and 8, with their expressions in 
terms of the other a;’s and B;’s, because this replacement would destroy the symmetry 
that is present in the experiment among the different levels of each factor. 


Estimating the Parameters 


The following result is straightforward, but tedious, to prove. 


Define 
cz 
Yii= 7 you fori=1,. 
ic 
Y,,=- Y,, forj=1,...,J, 11.7.5 
+] 1 2, 1] J ( ) 
G22 dsc iz 
Yop= > => Dn Hs, 
lJ I 4 J 
i=l j=1 t= j=l 
Then the M.L.E.’s (and least-squares estimators) of , a;,...,a@,, and B,..., By 
are as follows: 
A=Y44, 
@; =Yi4 —V4y fori=1,..., J, (11.7.6) 
By =Ys;-Yos for j=1,...,J/ 


The M.L.E. of o? a be 


a mn a; Bj = U Espa — ¥,,) a 


J ja i=] j=1 


It is easily verified (see Exercise 6) that yy a; = YS B; = 0; E(t) =p; E(a;) =a; 
fori=1,..., 7; and E(B;) =B; for j =1,..., J. Because E(%;) =u+a; + B;, the 
M.L.E. of E(¥;;) is 

¥,=Vi,t+¥4;-Ys, =h+6,+6,, 
which is also called the fitted value for Y;;. 
Radioactive Isotope in Milk. Consider again Example 11.7.1. Suppose that the concen- 


trations of the radioactive isotope measured in picocuries per liter by three different 
methods in specimens of milk from four dairies are as shown in Table 11.21. From 
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Table 11.21 Data for Example 11.7.2 
Method 
Dairy 1 2 3 
i 6.4 3.2 6.9 
2 8.5 7.8 10.1 
3 9.3 6.0 9.6 
4 8.8 5.6 8.4 


Table 11.22 Fitted values for observations in Example 11.7.2 
Method 
Dairy 1 2 3 
1 6.2 3.6 6.7 
2 9:5 6.9 10.0 
3 9.0 6.4 9.5 
4 8.3 5:7 8.8 


Table 11.21, the row averages are Y,, =5.5, Y2, =8.8, Y3, = 8.3, and Y4, =7.6; the 
column averages are Y ,; = 8.25, Y 47 =5.65, and Y ,3 = 8.75; and the average of all 
the observations is Y,, = 7.55. Hence, by Eq. (11.7.6), the values of the M.L.E.’s 
are fi = 7.55, & = —2.05, & = 1.25, 43 = 0.75, &4 = 0.05, 8; = 0.70, By = —1.90, and 
B3 = 1.20. 

The fitted values ¥,; for all of the observations are given in Table 11.22. By 
comparing the observed values in Table 11.21 with the fitted values in Table 11.22, 
we see that the differences between corresponding terms are generally small. These 
small differences indicate that the model used in the two-way layout, which assumes 
the additivity of the effects of the two factors, provides a good fit for the observed 
values. Finally, it is found from Tables 11.21 and 11.22 that 


I 
> SY, — ¥,;)? =2.74. 


i=l j=1 
Hence, by Theorem 11.7.1, 67 = 2.74/12 = 0.228. < 


Partitioning the Sum of Squares 


We shall partition the total sum of squares in much the same way that we did in 
Sec. 11.6. Begin with 


Sto = 25 5- Ray (11.7.7) 


t=1 j=1 


Theorem 
11.7.2 
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We shall now partition the sum of squares S into three smaller sums of squares. 
Each of these smaller sums of squares will be associated with a certain type of 
variation among the observations Y;;. Each of them (divided by o*) will have a x? 
distribution if certain null hypotheses are true, and they will be mutually independent 
whether or not the null hypotheses are true. Therefore, just as in the one-way layout, 
we can construct tests of certain null hypotheses based on an analysis of variance, 
that is, on an analysis of these different types of variation. 


Partitioning the Sum of Squares. Let a be as defined in Eq. (11.7.7). Then 
Stot = SResid + Si 7 Sh. (11.7.8) 
where 


rg 

2 y y A 

Spai= Dp y= Fa Yat, 
=— =! 


I 

2 yy y 2 

Sa=J > Ten Fad ? 
i=1 


J 

2 vy y 42 

Seat Wg ha. 
j=l 


Furthermore, 82 aghO has the x? distribution with (J — 1)(J — 1) degrees of free- 
dom, and the three component sums of squares are mutually independent. 


Proof We shall begin by rewriting Se as follows: 


rod 
Ste =>) IM — Kis —¥i,4%)+04-Yap+%y-YapP. (117.9) 
== 


By expanding the right side of Eq. (11.7.9), we obtain (see Exercise 8) Eq. (11.7.8). 
It can be shown that the random variables SS eas Ne and 2 are independent. 
(See Exercise 9 for a related result.) Furthermore, it can be shown that se esiq Has the 


x? distribution with J — + J—-)=(U-1(VJ -1) degrees of freedom. a 


It is easy to see that oS measures the variation of the sample means for the differ- 
ent levels of factor A around the overall sample mean. Similarly, i measures the 
variation of the sample means for the different levels of factor B around the overall 
sample mean. By using relations (11.7.6), we can rewrite Space as 


IJ I oJ 
SResia = Dy LM — A 8 — BY = DOD NH, — Yay 


i=1 j=l i=1 j=1 


This makes it clear that eae q measures the residual variation, that is, the variation 
between the observations not explained by the model. The partitioning is summa- 
rized in Table 11.23, which is the ANOVA table for the two-way layout. As in the 
case of the one-way layout, the degrees of freedom will turn out to be degrees of 
freedom for various x* random variables when certain null hypotheses are true. 
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Table 11.23 General ANOVA table for two-way layout 
Source of Degrees of 
variation freedom Sum of squares Mean square 
Factor A 1-1 Se s3/a-V 
Factor B J-1 bye S3/(J — 1) 
Residuals (J — 1)(J — 1) Becca S2 asia/L — DJ - 0] 
Total y-1 Soe 


Table 11.24 ANOVA table Example 11.7.3 
Source of Degrees of 
variation freedom Sum of squares Mean square 
Dairy 3 18.99 6.33 
Method 2 22.16 11.08 
Residuals 6 2.74 0.4567 
Total 11 43.89 


Radioactive Isotope in Milk. Using the estimates calculated in Example 11.7.2, we 
can compute the ANOVA table in Table 11.24. After we develop appropriate test 
statistics, we can use Table 11.24 to test hypotheses about the effects of the two factors. 

< 


Testing Hypotheses 


Radioactive Isotope in Milk. Consider again the situation described in Example 11.7.2 
involving four dairies and three measurement methods. We might be interested in 
testing that, for each of the three methods of measurement, the distributions of con- 
centration of isotope do not differ from one dairy to the next. If we regard the dairy 
as factor A and the measurement method as factor B, then the hypothesis that a; = 0 
fori =1,..., 7 means that for each method of measurement, the concentration of 
the isotope has the same distribution for all four dairies. In other words, there are 
no differences among the dairies. Alternatively, we might be interested in testing the 
hypothesis that, for each dairy, the three methods of measurement all produce the 
same distribution of concentration of isotope. For this case, the hypothesis that 6; =0 
for j =1,..., J means that for each dairy, the three methods of measurement yield 
the same distribution for the concentration of the isotope. However, this hypothesis 
does not state that regardless of which of the three different methods is applied to 
a particular specimen of milk, the same value would be obtained. Because of the 
inherent variability of the measurements, the hypothesis states only that the values 
yielded by the three methods have the same normal distribution. < 


Theorem 
11.7.3 
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In a problem involving a two-way layout, we are often interested in testing the 
hypothesis that one or both of the factors has no effect on the distribution of the 
observations. In other words, we are often interested either in testing the hypothesis 
that all of the effects a,,..., a, of factor A are equal to 0 or in testing the hypothesis 
that all of the effects 6,,..., 6, of factor B are equal to 0 or in testing that all of the 
a; and f; are 0. For the remainder of the discussion of testing hypotheses, it will be 
useful to define 


52 1/2 
o'= (-“] ; (11.7.10) 
=) =1) 


Consider the following hypotheses: 


Aj: a,=0 fori=1,...,/, 
(11.7.11) 
H,: The hypothesis Hp is not true. 


If Ap is true, then the following random variable has the F distribution with J — 1 
and (J — 1)(J — 1) degrees of freedom: 


»_ SA 
U,= G= De® (11.7.12) 
Similarly, suppose next that the following hypotheses are to be tested: 
A: B; =0 for j=1,...,J, 
(11.7.13) 


H,: The hypothesis Hp is not true. 


When the null hypothesis Hp is true, the following statistic has the F distribution with 
J —1and (J — 1)(J — 1) degrees of freedom: 


2 Sh 
U;= Gabe? (11.7.14) 
Finally, suppose that the following hypotheses are to be tested: 
Ap: oa; =Ofori=1,...,/7, and B; =Ofor j=1,...,J, 
(11.7.15) 


H,: The hypothesis Hp is not true. 


When the null hypothesis Hp is true, the following statistic has the F distribution with 
I+J—2and (J — 1)(J — 1) degrees of freedom: 


a 4 <2 
2 Si + Sp 


Us p= SS. 11.7.16 
AtB (I+ J —2)0? ( ) 


For each case above, a level ap test of the hypotheses is to reject Hp if the corre- 
sponding statistic (, Us, or U% .p) 18 at least as large as the 1 — ay quantile of the 
correpsonding F distribution. 


Proof We shall prove the claim for hypotheses (11.7.11). The proof for hypotheses 
(11.7.13) is virtually identical. The proof for hypotheses (11.7.15) is similar and is left 
for Exercise 16. Since = £; =0, we conclude that Y;, has the normal distribution 
with mean j and variance o2 /J for eachi=1,..., 7. Since the V2. are independent 
and Y,, is the average of Y;,,..., Y;,, Theorem 8.3.1 says that the following 
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random variable has the x? distribution with J — 1 degrees of freedom: 


i 2 
+ ee 
=} Vax=Vuye— 


Since eo has the x? distribution with (J — 1)(J — 1) degrees of freedom, we 
now conclude that 


82/1) 
S2 oaa/LU — DV - D] 


has the F distribution with J — 1 and (J — 1)(J — 1) degrees of freedom. It is easy to 
see that the random variable in (11.7.17) is the same as U* defined in Eq. (11.7.12). 


Let oe U—1¢ ane —ao) denote the 1— ap quantile of the F distribution with 
I —1 and (J — 1)(J — 1) degrees of freedom. Let 5 be the test that rejects Hp if 


Us > Prec — a), and let 2(6|5) be its power function for each parameter 
vector 6. Since U* has the stated F distribution for all parameter vectors @ that are 
consistent with Hp, it follows that for each such 6, 7(6|5) = ap, and 6 is a level ag test. 


(1.7.17) 


Notice that U* in Theorem 11.7.3 is the ratio of the factor A mean square to the 
residuals mean square in Table 11.23. When the null hypothesis Hp in (11.7.12) is 
not true, the value of a; = E(Y;4 — Y 4) is not 0 for at least one value of i. Hence, 
the expectation of the numerator of U : will be larger than it would be when Hp is 
true. (See Exercise 1.) The distribution of the denominator of U , remains the same 
regardless of whether Hp is true. It can also be shown that the test in Theorem 11.7.3 
is also the level ap likelihood ratio test procedure for the hypotheses (11.7.11). 


Testing for Differences among the Dairies. Suppose now that it is desired to use the 
observed values in Table 11.21 to test the hypothesis that there are no differences 
among the dairies, that is, to test the hypotheses (11.7.11). In this example, the statistic 
UG defined by Eq. (11.7.12) has the F distribution with three and six degrees of 
freedom. Using the ANOVA table in Table 11.24, we find that U% = 6.33/0.4567 = 
13.86. The corresponding p-value is smaller than 0.025, the smallest value in the tables 
in this book. Using statistical software, we compute the p-value to be about 0.004. 
So the hypothesis that there are no differences among the dairies would be rejected 
at all levels of significance of 0.004 or more. <l 


Testing for Differences among the Methods of Measurement. Suppose next that it is 
desired to use the observed values in Table 11.21 to test the hypothesis that each of 
the effects of the different methods of measurement is equal to 0, that is, to test the 
hypotheses (11.7.13). In this example, the statistic UZ, defined by Eq. (11.7.14) has 
the F distribution with two and six degrees of freedom. Using the ANOVA table in 
Table 11.24, we find that ue = 11.08/0.4567 = 24.26. The p-value corresponding to 
this observation is about 0.001, so the hypothesis that there are no differences among 
the methods would be rejected at all levels of significance greater than 0.001. Jl 


Summary 


The two-way layout can be considered as a general linear model, but the hypotheses 
of interest concern more than one linear combination of the regression coefficients. 
An ANOVA table was developed for the two-way layout that can be used for forming 
test statistics for various hypotheses. When we have only one observation at each 
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combination of factor levels, we assume that the effects of the two factors are additive. 
Then we can test the two null hypotheses that each of the two factors make no 
difference to the means of the observations. These tests make use of the test statistics 
Us in Eq. (11.7.12) and U. : in Eq. (1.7.14). If the corresponding null hypotheses are 
true, each of these statistics has an F distribution. 


Exercises 


1. Suppose that the null hypothesis Hp in (11.7.11) is false. 
Show that E(S2) = (I — 1)o? +. J Y_, a?. 


2. Consider a two-way layout in which the values of E(Y;;) 
fori=1,...,/ and j=1,..., J are as given in each of 
the following four matrices. For each matrix, state whether 
the effects of the factors are additive. 


a. 
Factor B 
Factor A 1 2 
5 7 
2 10 14 
b. 
Factor B 
Factor A 1 2 
3 6 
2 4 7 
c. 
Factor B 
Factor A 1 2 3 4 
3 —1 0 3 
4 5 8 
3 0 4 
d. 
Factor B 
Factor A 1 2 3 4 
2 3 4 
4 6 8 
3 3 6 9 12 


3. Show that if the effects of the factors in a two-way 
layout are additive, then there exist unique numbers jp, 
Qy,...,@,, and f;,..., 6, that satisfy Eqs. (11.7.2) and 
(11.7.3). Hint: Let yx be the average of all 6; + w; values, 
let a; equal 6; minus the average of the 6;’s, and similarly 
for ;. 


4. Suppose that in a two-way layout, with J = 2 and J = 2, 
the values of E(Y;;) are as given in part (b) of Exercise 2. 
Determine the values of jz, a1, a, 61, and > that satisfy 
Eggs. (11.7.2) and (11.7.3). 


5. Suppose that in a two-way layout, with J =3 and J = 4, 
the values of E(Y;;) are as given in part (c) of Exercise 2. 
Determine the values of ju, a1, @, 3, and ),..., 64 that 
satisfy Eqs. (11.7.2) and (11.7.3). 


6. Verify that if 7, @;, and B; are defined by Eq. (11.7.6), 
then ar a; = 4 B; =0; E(fi) =p; E(a;) =a; for 
(heen md BO aor gS ad: 


7. Show that if i, a@;, and B; are defined by Eq. (11.7.6), 
then 


a 1 5 
Vi =o", 
ar ({L) i 
Var(@;) = te 3 fort =1,..2..4, 
IJ 


Var(B;) = o” forj=1,..., J. 


8. Show that the right sides of Eqs. (11.7.9) and (11.7.8) 
are equal. 


9. Show that in a two-way layout, for all values of i, j, 
i’, and j’ (i andi’=1,...,/; j and j’=1,..., J), the 
following four random variables W,, W2, W3, and W4 are 
uncorrelated with one another: 


Wi = Vij — Yi 


Yujp+ Vis, 


Wr=Vi4—-Ya4, W3= Ya; aeeee 


Wa =Y¥44. 


10. Show that 
ve I 5 : 
Oi - Yaw = DY hy 
t=1 f=] 
and 


J J 

= = 2 = 2 
Yj -Ya =, IF y- 
j=l j=l 
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11. Show that 
2 a, 2 
— Yap t+Va4) 


I J 
iol) ea, 
j=l 

12. In astudy to compare the reflective properties of var- 
ious paints and various plastic surfaces, three different 
types of paint were applied to specimens of five different 
types of plastic surfaces. Suppose that the observed results 
in appropriate coded units were as shown in Table 11.25. 


13. For the conditions of Exercise 12 and the data in 
Table 11.25, determine the value of the least-squares 
estimate of E(¥;j;) fori =1,2,3, and j=1,...,5, and 
determine the value of 62. 


14. For the conditions of Exercise 12 and the data in 
Table 11.25, test the hypothesis that the reflective proper- 
ties of the three different types of paint are the same. 


15. For the conditions of Exercise 12 and the data in 
Table 11.25, test the hypothesis that the reflective prop- 
erties of the five different types of plastic surfaces are the 
same. 


Determine the values of (i, 1, &, 43, and B;,..., Bs. 
16. Prove the claim in Theorem 11.7.3 about the distribu- 
Table 11.25 Data for Exercises 12-15 tion of Ug 
Type of surface 
Type of paint 1 2 3 4 5 
1 14.5 13.6 16.3 23.2 19.4 
2 14.6 16.2 14.8 16.8 17.3 
3 16.2 14.0 15.5 18.7 21.0 
* 11.8 The Two-Way Layout with Replications 
Suppose that we obtain more than one observation in each cell of a two-way layout. 
In addition to testing hypotheses about the separate effects of the two factors, we 
can also test the hypothesis that the additivity assumption (11.7.3) holds. However, 
the interpretations of the separate effects of the two factors are more complicated 
if the additivity assumption fails. When the additivity assumption fails, we say that 
there is interaction between the two factors. 
The Two-Way Layout with K Observations in Each Cell 
Example Gasoline Consumption. Suppose that an experiment is carried out by an automobile 


11.8.1 


manufacturer to investigate whether a certain device, installed on an automobile, 
affects the amount of gasoline consumed by the automobile. The manufacturer pro- 
duces three different models of automobiles, namely, a compact model, an interme- 
diate model, and a standard model. Five cars of each model, which were equipped 
with this device, were driven over a fixed route through city traffic, and the gasoline 
consumption of each car was measured. Also, five cars of each model, which were 
not equipped with this device, were driven over the same route, and the gasoline 
consumption of each of these cars was measured. The results, in liters of gasoline 
consumed, are given in Table 11.26. 

The same sorts of questions that arose in Sec. 11.7 arise here. For example, are 
the mean gasoline consumptions different for cars with and without the device? Are 
the mean gasoline consumptions different for the three car models? An additional 
question can be addressed in an example like this in which there are multiple obser- 
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Table 11.26 Data for Example 11.8.1 
Compact Intermediate Standard 
model model model 

Equipped with device 8.3 92 11.6 
8.9 10.2 10.2 

7.8 9.5 10.7 

8.5 11.3 11.9 

9.4 10.4 11.0 

Not equipped with device 8.7 8.2 12.4 
10.0 10.6 11.7 

9.7 10.1 10.0 

7.9 11.3 11.1 

8.4 10.8 11.8 


vations under each combination of factors. We can ask whether the effect (if any) of 
the device is different for the different car models. <l 


We shall continue to consider problems of ANOVA involving a two-way layout. Now, 
however, instead of having just a single observation Y;; for each combination of i 
and j, we shall have K independent observations Y;, “for k=1,..., K. In other 
words, instead of having just one observation in each cell of Table 11.20, we have 
K i.1.d. observations. The K observations in each cell are obtained under similar 
experimental conditions and are called replications. The total number of observations 
in this two-way layout with replications is [/K. We continue to assume that all the 
observations are independent, each observation has a normal distribution, and all 
the observations have the same variance o?. 

We shall let 6;; denote the mean of each of the K observations in the (i, j) cell. 
Thus, fori=1,...,/;j=1,..., J;andk=1,..., K, we have 


E(Yjjx) = 9);- (11.8.1) 


In a two-way layout with replications, we shall no longer assume, as we did in 
Sec. 11.7, that the effects of the two factors are additive. Here we can assume that 
the expectations 6;; are arbitrary numbers. As we shall see later in this section, we 
can then test the hypothesis that the effects are additive. 

It is easy to verify that the M.L.E., or least-squares estimator, of 6;; is simply the 
sample mean of the K observations in the (i, j) cell. Thus, 


K 
S Vijn = Vijs- (11.8.2) 
k=1 


The M.L.E. of o? is therefore 


: po 7 
“~s IK 4 a 2 Dear (11.8.3) 
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In order to identify and discuss the effects of the two factors, and to examine 
the possibility that these effects are additive, it is helpful to replace the parameters 
G9, 1001 Sy cen gd and j =1,..., J, with a new set of parameters iL, a;, (cee and Vij- 
These new parameters are defined by the following relations: 


6,,=H+a;+8;+y; fori=1,...,7andj=1,...,J, (11.8.4) 


and 
I 


J 
Yia;=0, >> 6; =0, 
j=l 


i=l 


I 

Yop 0. for read, (11.8.5) 
i=1 

J 

YS teria... 

j=l 


It can be shown (see Exercise 1) that corresponding to each set of numbers 6;; for 
i=1,...,/andj=1,..., J, there exist unique numbers jz, @;, 6;, and y;; that satisfy 
Eggs. (11.8.4) and (11.8.5). 

The parameter jx is called the overall mean or the grand mean. The parameters 


ay,..., a, are called the main effects of factor A, and the parameters f,,..., B; 
are called the main effects of factor B. The parameters y;;, fori=1,...,/ and 
j=1,..., J,are called the interactions. It can be seen from Eqs. (11.8.1) and (11.8.4) 


that the effects of the factors A and B are additive if and only if all the interactions 
vanish, that is, if and only if y,; = 0 for every combination of values of i and j. 

The notation that has been developed in Sections 11.6 and 11.7 will again be 
used here. We shall replace a subscript of Y;;, with a plus sign to indicate that we 
have summed the values of Y;;, over all possible values of that subscript. If we have 
made two or three summations, we shall use two or three plus signs. We shall then 
place a bar over Y to indicate that we have divided this sum by the number of terms 
in the summation and have thereby obtained an average of the values of Y;;, for the 
subscript or subscripts involved in the summation. For example, 


1 K 

Vij ra » Yijke 
k=1 

va K 


= 1 
P= Tae yD Ye 


i=l k=1 


and Y,,, denotes the average of all 1/K observations. 
Similar logic to that used in the proof of Theorem 11.2.1 can be used to prove 
the following result. The details are left to Exercises 2 and 5). 


The M.L.E.’s (and least-squares estimators) of ju, @;, and 6; are as follows: 


B=Vi44, 
Qj SF ict —Yi44 fori =. Sey, f. (11.8.6) 
B= Vash = Yaa for j=1,...,J. 


Also, fori=1,...,7andj=1,...,J, 
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Table 11.27 Cell averages in Example 11.8.2 
Compact Intermediate Standard Average 
model model model for row 
Equipped with device Vic = 8.50 Fie, ]10.12 ¥ 134 = 11.08 Y¥ 444 = 9.9267 
Not equipped with device Yo44. = 8.94 Fon, = 10.20 ¥734 = 11.40 Y>,, = 10.1800 
Average for column ¥ 9.8.76 Yio = 10.16 Fog. = 11.24 ¥ 44g 10.0533 


Example 
11.8.2 


Theorem 
11.8.2 


(11.8.7) 


Pj =Vijs — +4; +B) 
=a - 


Also, for all values of i and j, E(i) = pw, E(a;) =a,, E(B;) = B,, and E(/;;) = y;,;- 


Gasoline Consumption. In Example 11.8.1, let the A factor be the device, and let the 
B factor be the car model. Then we have J = 2, J = 3, and K = 5. The average value 
Y;;+ for each of the six cells in Table 11.26 is presented in Table 11.27, which also 
gives the average value Y;,, for each of the two rows, the average value Y , ;, for 
each of the three columns, and the average value Y ,, , of all 30 observations. 

It follows from Table 11.27 and Eqs. (11.8.6) and (11.8.7) that the values of the 
M.L.E.’s, or least-squares estimators, in this example are 


@ = 10.0533, @ = -—0.1267, @ = 0.1267, 
By = 1.2933, fp, = 0.1067, f,; = 1.1867, 
Yu = 0.0533, ~j2 = 0.0867, 3 = —0.0333, 
Yo = 0.0533, foo = —0.0867, 73 0.0333. 
In this example, the estimates of the interactions y;; are small for all values of i and j. 
< 
Partitioning the Sum of Squares 
Consider now the total sum of squares, 
JK 
Sto = Dy Mik Vrs) (11.8.8) 


i=l j=l 


k=1 


We shall now indicate how SF can be partitioned into four smaller independent sums 
of squares, each of which is associated with a particular type of variation among the 
observations. Under various null hypotheses, each sum of squares (divided by o7) 
will have a x? distribution. 


Let SY be as defined in Eq. (11.8.8). Then 


s2 2 


ae 2 
Tot = 54 + Sz + Stnt + SResiar (11.8.9) 
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where 
I 
SHIEK) Cae - Fea, (11.8.10) 
i=l 
J 
Sg =1K D104 54 —Vaae)s 
j=l 
IodJ 
Stat = K Mis =Kjee= Vje vee) 
i=1 j=l 
I oJ «K 
SResia = _ a > Dee? ijk ~~ Vas). 
i=] j=l k=1 


In addition, CA oa has the x? distribution with /J(K — 1) degrees of freedom. 
If all w; = 0, then sf /o* has the x distribution with J — 1 degrees of freedom. If 
all 6; =0, then RE ior has the x? distribution with J — 1 degrees of freedom. If all 
Vij = 0, then Na Tat Jo? has the x? distribution with (J — 1)(J — 1) degrees of freedom. 
The four sums of squares are mutually independent. 


Proof The proof of (11.8.9) is left to the reader in Exercise 7. 

The random variable coe lo is the sum of // independent random variables 
of the form ye — Yat ie. According to Theorem 8.3.1, each of these JJ 
random variables has the x? distribution with K — 1 degrees of freedom. Hence, the 
sum of all J of them has the x? distribution with 1/(K — 1) degrees of freedom. If 
all of the a; = 0, then Y;,,,..., Y;4, all have the normal distribution with mean w 
and variance o7/J K. Theorem 8.3.1 implies that s /o* has the x? distribution with 
I — 1 degrees of freedom. Similarly, if all 6; = 0, then he /o* has the x? distribution 
with J — 1 degrees of freedom. 

The number of degrees of freedom for $2, can be determined as follows: If all of 


Int 
the y;; = 0, then the additivity assumption holds, and NG , is the same as 82... from 


Resid 
Sec. 11.7 except for the fact that each Y; j+ has the ioral distribution with mean 
i. +a; + B; and variance o*/K instead of variance o”. This means that if all y; j;=9, 
then on /o7 has the x? distribution with (J — 1)(J — 1) degrees of freedom. 

Finally, it can be shown that all of the sums of squares in relations (11.8.10) are 
independent (see Exercise 8 for a related result). a 


The claims in Theorem 11.8.2 are summarized in Table 11.28, which is the 
ANOVA table for the two-way layout with K observations per cell. 


Gasoline Consumption. Using the sample means computed in Example 11.8.2, we can 
form the ANOVA table in Table 11.29. We shall use the mean squares in Table 11.29 
to test various hypotheses about the effects of the factors after we develop test 
procedures. S| 


Testing Hypotheses 


As mentioned before, the effects of the factors A and B are additive if and only if 
all the interactions y,;; vanish. Hence, to test whether the effects of the factors are 
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Table 11.28 General ANOVA table for two-way layout with replication 
Source of Degrees of Sum of 
variation freedom squares Mean square 
Main effects of A I-1 Sa Sot —1) 
Main effects of B J-1 Se S10 —1) 
Interactions d-)iVJ-1 am 8S? /Id-DU - Dd) 
Residuals IJ(K —1) Ne ca S2 aa/LI(K — 1)] 
Total IIK —1 ey 


Table 11.29 ANOVA table for data from Example 11.8.2. 
Source of Degrees of Sum of 
variation freedom squares Mean square 
Main effects of device if 0.4813 0.4813 
Main effects of model 2 30.92 15.46 
Interactions 2 0.1147 0.0573 
Residuals 24 18.22 0.7590 
Total 29 49.73 


additive, we must test the following hypotheses: 


Aj: y¥j;=90 fori=1,...,7andj=1,...,J, 
; ; (11.8.11) 
H,: The hypothesis Hp is not true. 


It follows from Theorem 11.8.2 that when the null hypothesis Hp is true, the 
random variable os /o” has the x? distribution with (J — 1)(J — 1) degrees of free- 
dom. Furthermore, regardless of whether or not Hp is true, the independent random 
variable Si d /o7 has the x? distribution with 1/(K — 1) degrees of freedom. Thus, 
when Hp is true, the following random variable U* p has the F distribution with 
U —1)(J — 1) and L/(K — 1) degrees of freedom: 


2 
2 (kK =1)S;,; 


= 11,812 
AB (I= DI - 083, ( ) 


esid 
which is also the ratio of the interaction mean square to the residual mean square. 
The null hypothesis Hy would be rejected at level ag if 


U2 


aB = F, 


=1 
U-I—-),1(K-y A — a), 


where Paw J-1).11(K-1) is the quantile function of the F distribution with (J — 


(J — 1) and I/(K — 1) degrees of freedom. 
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Example 
11.8.4 


Gasoline Consumption. Suppose that it is desired to use the data from Example 11.8.2 
to test the null hypothesis that the effects of equipping a car with the device and 
using a particular model are additive, against the alternative that these effects are not 
additive. In other words, suppose that it is desired to test the hypotheses (11.8.11). 
Using the mean squares in Table 11.29 and Eq. (11.8.12), we compute that ie = 
0.0573/0.7590 = 0.076. The corresponding p-value can be found using statistical 
software, and its value is 0.9275. Hence, the null hypothesis that the effects are 
additive would be not be rejected at any common level of significance. J 


If the null hypothesis Hp in (11.8.11) is rejected, then it suggests that at least 
some of the interactions y;; are not 0. Therefore, the means of the observations for 
certain combinations of i and j will be larger than the means of the observations for 
other combinations, and both factor A and factor B affect these means. In this case, 
because both factor A and factor B affect the means of the observations, there is not 
usually any further interest in testing whether either the main effects a), ..., a, or 
the main effects 6,,..., 6, are zero. 

On the other hand, if the null hypothesis Hp in (11.8.11) is not rejected (as is 
the case in Example 11.8.4), then we might decide to act as if all the interactions 
are 0. If, in addition, all the main effects a;,..., a; were 0, then the mean value of 
each observation would not depend in any way on the value of i. In this case, factor 
A would have no effect on the observations. Therefore, if the null hypothesis Ho in 
(11.8.11) is not rejected, we might be interested in testing the following hypotheses: 

Ay: a, =Oandy,,=Ofori=1,...,/andj=1,...,J, 


: ; (11.8.13) 
H,: The hypothesis Hp is not true. 


Indeed, we might be interested in testing these hypotheses even if we had not first 
tested the hypotheses (11.8.11). 

According to Theorem 11.8.2, if Hp is true, then ss /o” and S efar are inde- 
pendent having x2 distributions with J — 1 and (J — 1)(J — 1) degrees of freedom, 
respectively. It follows that, when Hp in (11.8.13) is true, the following random vari- 
able U* has the F distribution with J -1+ J -—1)(J-—1)=(U —1)J and IJ(K — 1) 
degrees of freedom: 


2 _ LIK ~ ISK + Sine] 


‘ = LI SResia 


(1.8.14) 


If we did not test the hypotheses (11.8.11) first, then we can reject Hp in (11.8.13) at 
level ap if Us > Fe peal — a). 

If we first tested (11.8.11) and failed to reject the null hypothesis, there are two 
important considerations to emphasize before proceeding with a test of (11.8.13). 
First, the size of the second test, the test of (11.8.13), should be calculated conditional 
on having failed to reject the null hypothesis in (11.8.11). That is, if the second test is 
to reject the null hypothesis in (11.8.13) if T > c (for some statistic T), then the size 
of the second test should be the conditional probability 


2 -1 
BT ee | Uy ly gee —&): (1.8.15) 


Calculation of this conditional probability is beyond the scope of this book, but it can 
be approximated using simulation methods that will be introduced in Chapter 12. 
(See Example 12.3.4 for an illustration.) 

The second consideration involves the choice of test statistic T for testing 
(11.8.13). For the case in which we did not first test (11.8.11), the statistic Uw in 


Example 
11.8.5 
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(11.8.14) is a sensible test statistic. However, if we have already failed to reject the 
null hypothesis in (11.8.11), a better test statistic might be 


o_ 1K -1S) 


= 11.8.16 
A = D82 


esid 


One reason for this is that, with T = Vi, the probability in (11.8.15) will often be 
closer to a than with T = Us For instance, if //(K — 1) is large and Hp is true, 
then Se cia should be close to o* with high probability. In this case, since Ss and 
o are independent random variables, the random variables vi and U 2 p Should be 
nearly independent as well. This will make the test based on vi nearly independent 
of whether or not the test based on Us z rejected its null hypothesis. On the other 
hand, because 


1 
2 2 2 
U4 = A + (J _ Us al 


we see that the dependence between U* and Us p is likely to be quite high under all 
circumstances. 

So, if we first test (11.8.11) and fail to reject the null hypothesis, we should then 
use vi to test (11.8.13). We would then reject the null hypothesis if vi >c, where 
c is some constant. Unfortunately, we still cannot find a useable expression for c 
other than to note that the size of this second test, conditional on the first test, is 
(11.8.15) with T = ae We can use simulation methods to compute this if necessary. 
(See Example 12.3.4.) The overall size of this two-stage procedure is larger than 


ay. (See Exercise 19.) In practice, it is common to let c= F,_, 1(K—1 (1 — @) and 


pretend as if (11.8.15) with T = V7 is essentially a. 


Gasoline Consumption. Suppose now that it is desired to test the null hypothesis that 
the device has no effect on gasoline consumption for all of the car models tested, 
against the alternative that the device does affect gasoline consumption. In other 
words, suppose that it is desired to test the hypotheses (11.8.13). If we had not first 
tested (11.8.11), then we would use Eq. (11.8.14) and the numbers in Table 11.29 to 
compute UZ = 24(0.4813 + 0.1147) /[3(18.22)] = 0.2616. The corresponding p-value 
from the F distribution with 3 and 24 degrees of freedom is 0.8523. Hence, the null 
hypothesis would not be rejected at the usual levels of significance. 

On the other hand, since we did test (11.8.11) first, we should instead use vi = 
0.4813/0.7590 = 0.6341. We cannot compute the exact conditional p-value associated 
with this observed value. However, using the method to be described in Exam- 
ple 12.3.4, we can approximate the p-value to be about 0.43, given that we failed 
to reject the null hypothesis in (11.8.11). We can also use the method of Exam- 
ple 12.3.4 to approximate the probabilities in (11.8.15) for T = oe and for T = Ve 
With ap = 0.05, these approximations are, respectively, 0.019 and 0.048. Notice how 
close the test based on vi comes to having the nominal size ag = 0.05, while the 
conditional size of the test based on U is much smaller. < 


Similarly, we may want to find out whether all the main effects of factor B, as 
well as the interactions, are 0. In this case, we would test the following hypotheses: 


Ho: B; =Oand y;; =0 fori=1,...,/, andj=1,..., J, 


; ; (11.8.17) 
H,: The hypothesis Hp is not true. 
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By analogy with Eq. (11.8.14), it follows that when Hp is true, the following random 
variable U Z has the F distribution with /(J — 1) and 1J(K — 1) degrees of freedom: 


2 _ 1I(K ~ W1S5 + Sin] 


. (J — 1) SResia 


(1.8.18) 


Again, if we do not first test (11.8.11), then the hypothesis Hp should be rejected 


at level ap if U2 > a 11(K 11 — &%)- If we test (11.8.1) first and fail to reject 


the null hypothesis, then we should reject Ho in (11.8.17) if Vi is too large, where 

V3 = eee, The conditional level of this test must be computed by simulation, 
Resid 

also. 


In a given problem, if the null hypothesis in (11.8.11) is not rejected and the null 
hypotheses in both (11.8.13) and (11.8.17) are rejected, then we may be willing to 
proceed with further studies and experimentation by using a model in which it is 
assumed that the effects of factor A and factor B are approximately additive and the 
effects of both factors are important. 

The results obtained in Example 11.8.5 do not provide any indication that the 
device is effective. Nevertheless, it can be seen from Table 11.27 that for each of the 
three models, the average consumption of gasoline for the cars that were equipped 
with the device is smaller than the average consumption for the cars that were not so 
equipped. If we assume that the effects of the device and the model of automobile 
are additive, then regardless of the model of the automobile that is used, the M.L.E. 
of the reduction in gasoline consumption over the given route that is achieved by 
equipping an automobile with the device is @ — @, = 0.2534 liter. 


The Two-Way Layout with Unequal Numbers of Observations 
in the Cells 


Consider again a two-way layout with J rows and J columns, but suppose now that 
instead of there being K observations in each cell, some cells have more observa- 
tions than others. Fori=1,..., 7 and j =1,..., J, we shall let Ki denote the 
number of observations in the (i, j) cell. Thus, the total number of observations 
is es ae K;;. We shall assume that every cell contains at least one observation, 
and we shall again let Y; jx denote the kth observation in the (i, j) cell. For each value 
of i and j, the values of the subscript k are 1,..., K ve We shall also assume, as be- 
fore, that all the observations Y;;, are independent; each has a normal distribution; 
Var (Yi jx) = o° for all values of i, j, and k; and E(Y¥ij,) =H +a; + Bj + jj, where 
these parameters satisfy the conditions given in Eq. (11.8.5). 

As usual, we shall let Yi; , denote the average of the observations in the (i, j) 
cell. It can then be shown that fori=1,...,/7 and j=1,..., J, the M.L.E.’s, or 
least-squares estimators, are as follows: 


(11.8.19) 
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These estimators are intuitively reasonable and analogous to those given in Eqs. 
(11.8.6) and (11.8.7). 

Suppose now, however, that it is desired to test hypotheses such as (11.8.11), 
(11.8.13), or (11.8.17). The construction of appropriate tests becomes somewhat 
more difficult because, in general, the sums of squares analogous to those given in 
Eq. (11.8.10) will not be independent when there are unequal numbers of observa- 
tions in the different cells. Hence, the test procedures presented earlier in this section 
cannot be directly copied here. It is necessary to develop other sums of squares that 
will be independent and will reflect the different types of variations in the data in 
which we are interested. We shall not consider the problem further in this book. 
This and other problems of ANOVA are described in the advanced book by Scheffé 
(1959). 


Summary 


We extended the analysis of the two-way layout to cases in which we have equal 
numbers of observations at all combinations of levels of the two factors. One addi- 
tional null hypothesis that we can test in this case is that the effects of the two factors 
are additive. (We assumed that the effects were additive when we had only one ob- 
servation per cell.) If we reject the null hypothesis of additivity, we typically do not 
test any further hypotheses. If we don’t reject this null hypothesis, we might still be 
interested in whether one of the two factors has any effect at all on the means of the 
observations. Even if we do not first test the null hypothesis that the effects of the 
two factors are additive, we might still be interested in whether one of the factors 
has an effect. The precise form of a test of one of these last hypotheses depends on 
whether we first test that the effects are additive. 


Exercises 
1. Show that for every set of numbers 6; j= Des ceca the random variables in this exercise is a linear function of 
and j =1,..., J), there exists a unique set of numbers jz, the Y;;,’s, and hence the expectations are the same linear 


aj, Bj, and Vij = 1,..., 7 and j =1,..., J) that satisfy 
Eggs. (11.8.4) and (11.8.5). 


2. Verify that Eq. (11.8.6) gives the M.L.E.’s of the param- 
eters of the two-way layout with replication. 


3. Suppose that in a two-way layout, the values of 6;; are 
as given in each of the four matrices presented in parts 
(a), (b), (c), and (d) of Exercise 2 of Sec. 11.7. For each 
matrix, determine the values of ju, a;, B fe and Vij that 
satisfy Eqs. (11.8.4) and (11.8.5). 


4. Verify that if a;, Bj. and y;; are as given by Eqs. (11.8.6) 


and (11.8.7), then )j_, &; =0, 24_; Bj =0, Yj_, Mj =0 
be Sly gad gp OTS Ey ads 


5. Verify that if “, a, 6;, and y;; are as given by Eqs. 
(11.8.6) and (11.8.7), then E(ft) = w, E(@;) =a;, E(B)) = 
B;, and E(y;;) = y;; for all values of i and j. Hint: Each of 


combinations of the expectations of the Y;;;’s. 


6. Show that if (i, a, B;, and y;; are as given by Eqs. 
(11.8.6) and (11.8.7), then the following results are true 
for all values of i and j: 


(f= 1) 


A I 2 A 9) 
Vi = —0o"', Vi j= : 
‘ar (jl) UK? ar(a@;) UK oO 
n (J-1 P (I-1)(J -1 
Van(By) = “Taga”, Vath) =e 


7. Verify Eq. (11.8.9). 


8. In a two-way layout with K observations in each cell, 
show that for all values of i, i4, in, j, j4, j2, and k, the 
following five random variables are uncorrelated with one 
another: 


Yije — Vij4s Qi Bir Vin and Lf. 
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9. Verify that U7 p also equals 


I J I 
UR(K W001 YY I int 
i=1 j=l 


i=] 


sf 
= peer + FIV 
j=1 
; I J «K 
C= DI- DOTY Min 
i=1 j=l k=1 
i. og 
—- oy os Yi) 


i=1 j=l 


10. Suppose that in an experimental study to determine 
the combined effects of receiving both a stimulant and a 
tranquilizer, three different types of stimulants and four 
different types of tranquilizers are administered to a group 
of rabbits. Each rabbit in the experiment receives one of 
the stimulants and then, 20 minutes later, receives one of 
the tranquilizers. After one hour, the response of the rab- 
bit is measured in appropriate units. In order that each 
possible pair of drugs may be administered to two dif- 
ferent rabbits, 24 rabbits are used in the experiment. The 
responses of these 24 rabbits are given in Table 11.30. De- 
termine the values of (1, @;, B;, and Vij for i = 1, 2,3 and 
j =1, 2, 3, 4, and determine also the value of 62. 


Table 11.30 Data for Exercises 10-15 


Tranquilizer 
Stimulant 1 2 3 4 

1 11.2 7.4 7A 9.6 
11.6 8.1 7.0 7.6 

2 12.7 10.3 8.8 11.3 
14.0 7.9 8.5 10.8 

3 10.1 5.5 5.0 6.5 

9.6 6.9 7.3 5:7 


11. For the conditions of Exercise 10 and the data in 
Table 11.30, test the hypothesis that every interaction be- 
tween a stimulant and a tranquilizer is 0. 


12. For the conditions of Exercise 10 and the data in 
Table 11.30, test the hypothesis that all three stimulants 
yield the same responses. 


13. For the conditions of Exercise 10 and the data in 
Table 11.30, test the hypothesis that all four tranquilizers 
yield the same responses. 


14. For the conditions of Exercise 10 and the data in 
Table 11.30, test the following hypotheses: 

Ho 2 h= 8, 

A, >: Uh # 8. 
15. For the conditions of Exercise 10 and the data in 
Table 11.30, test the following hypotheses: 

Ho Ags 1, 

A, ‘5 ay > 1. 
16. Inatwo-way layout with unequal numbers of observa- 
tions in the cells, show that if (2, @;, B jj, and Vj ; are as given 
by Eq. (11.8.19), then E(t) = uw, E(a;) =a,, E(B;) = B;, 
and E(y;;) = y;; for all values of i and j. 
17. Verify that if (4, @, B j, and 7;; are as given by Eq. 
(11.8.19), then )/_, 4; =0, 74_, Bj =0, Y}_, Fi = 0 for 
j=l...., J,and 074 Fy =Ofori=1,..., 1. 


18. Show that if and @; are as given by Eq. (11.8.19), 
then fori=1,..., J, 


2 J 


a 141 1 
EO) = Ti 2 Ky @ 22 K,; 


Also, show that this covariance is 0 if all K;; are the same. 


19. Recall the two-stage testing procedure described in 
this section: First test (11.8.11) at level ag. If you reject 
the null hypothesis, stop. If you don’t reject the null hy- 
pothesis, then test (11.8.13). Let Bp be the conditional size 
of the second test given that the first test doesn’t reject 
the null hypothesis. Assume that both null hypotheses are 
true. Find the probability that this two-stage procedure 
rejects at least one of the null hypotheses. 


20. The study referred to in Exercise 10 in Sec. 11.6 actu- 
ally included another factor in addition to size of vehicle. 
There were two different filters, a standard filter and a 
newly developed filter. Table 11.19 has data only from the 
standard filter. The corresponding data for the new filter 
are in Table 11.31. 


Table 11.31 Data for Exercise 20. This table has data for 
the vehicles with the new filter. 


Vehicle size Noise values 


Small 820, 820, 820, 825, 825, 825 
Medium 820, 820, 825, 815, 825, 825 
Large 7715, 775, 775, 770, 760, 765 


a. Construct the ANOVA table for the two-way lay- 
out that includes the data from both Tables 11.19 
and 11.31. 


b. Compute the p-value for testing the null hypothesis 
that there is no interaction. 
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c. Compute the p-value for testing the null hypothesis 
that the vehicles of all three sizes produce the same 
level of noise on average. 

d. Compute the p-value for testing the null hypothesis 
that both filters result in the same level of noise on 
average. 


11.9 Supplementary Exercises 


1. Consider the data in Example 11.2.2 on page 703. Sup- 
pose that we fit a simple linear regression of the natural 
logarithm of pressure on boiling point. 


a. Find a 90 percent confidence interval (bounded in- 
terval) for the slope f,. 


b. Test the null hypothesis Ho : 6; = 0 versus Hy : B; 40 
at level ag = 0.1. 


c. Find a 90 percent prediction interval for pressure 
(not logarithm of pressure) when the boiling point 
is 204.6. 


2. Suppose that (X;, Y;),i=1,...,,formarandom sam- 
ple of size n from the bivariate normal distribution with 
means /1; and 9, variances of and Oe, and correlation p, 


and let /i;, ne and f denote their M.L.E.’s Also, let A de- 
note the M.L.E. of 5 in the regression of Y on X. Show 
that 


By = 62/64. 
Hint: See Exercise 24 of Sec. 7.6. 


3. Suppose that (X;, Y;),i =1,...,,formarandom sam- 
ple of size n from the bivariate normal distribution with 
means jy and i>, variances oy and Op. and correlation 
p. Determine the mean and the variance of the following 
statistic T, given the observed values X; = x1,..., X,= 


Xn: 


Sie Dy, 
Dini i — 3) | 


4. Let 6), 65, and 63 denote the unknown angles of a trian- 
gle, measured in degrees (6; > 0 for i =1, 2, 3, and 6; + 
6 + 03 = 180). Suppose that each angle is measured by 
an instrument that is subject to error, and the measured 
values of 6), 0), and 63 are found to be y; = 83, y> = 47, 
and y3 = 56, respectively. Determine the least-squares es- 
timates of 6), 65, and 63. 


i= 


5. Suppose that a straight line is to be fitted to n points 
(x1, 4), +--+» (Xp, Y,) Such that x» =x3=---=x, but x, # 
x7. Show that the least-squares line will pass through the 
point (x1, y,). 


6. Suppose that a least-squares line is fitted to the n points 
(x1, ¥1),-+-5 (%_, Y,) In the usual way by minimizing the 
sum of squares of the vertical deviations of the points 
from the line, and another least-squares line is fitted by 
minimizing the sum of squares of the horizontal deviations 
of the points from the line. Under what conditions will 
these two lines coincide? 


7. Suppose that a straight line y = £; + fox is to be fit- 
ted to the n points (x1, yj), ..-, (Xp, Y,) in such a way that 
the sum of the squared perpendicular (or orthogonal) dis- 
tances from the points to the line is a minimum. Determine 
the optimal values of 6, and f>. 


8. Suppose that twin sisters are each to take a certain 
mathematics examination. They know that the scores they 
will obtain on the examination have the same mean p, 
the same variance o?, and positive correlation p. Assum- 
ing that their scores have a bivariate normal distribution, 
show that after each twin learns her own score, the ex- 
pected value of her sister’s score is between her own score 
and p. 


9. Suppose that a sample of n observations is formed 
from k subsamples containing ny, ..., x, observations 
(ny +--+ +n, =n). Let x;; G=1,...,2;) denote the ob- 
servations in the ith subsample, and let x;, and ve denote 
the sample mean and the sample variance of that subsam- 
ple: 

Nn; 


n: 
ot ae 5. ih 
i j=l 


=. 3g 
(xij — Xi4)"- 
Ll ig 1 


Finally, let x,, and v? denote the sample mean and the 
sample variance of the entire sample of n observations: 


1 ko ON; 
=> DD xy: 


i=1 j=l 


n nj; 


v= . Yeo. 


i=1 j=l 


Determine an expression for v? in terms of ¥,.,, ¥;, and 
2G= 
vs (i =1,...,k). 


10. Consider the linear regression model 


Y; = Byw; + Box; + &; fori=1,...,n, 
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where (wy, x4),..., (Wy, X,) are given pairs of constants 
and ¢1,..., €, are 1.1.d. random variables, each of which 
has the normal distribution with mean 0 and variance o7. 


Determine explicitly the M.L.E.’s of 6, and A>. 


11. Determine an unbiased estimator of o7 in a two-way 
layout with K observations in each cell (K > 2). 


12. Ina two-way layout with one observation in each cell, 
construct a test of the null hypothesis that all the effects 
of both factor A and factor B are 0. 


13. In a two-way layout with K observations in each cell 
(K > 2), construct a test of the null hypothesis that all the 
main effects for factor A and factor B, and also all the 
interactions, are 0. 


14. Suppose that each of two different varieties of corn 
is treated with two different types of fertilizer in order 
to compare the yields, and that K independent replica- 
tions are obtained for each of the four combinations. Let 
X;;x denote the yield on the kth replication of the com- 
bination of variety i with fertilizer j (i =1, 2; j =1, 2; 
k=1,..., K). Assume that all the observations are inde- 
pendent and normally distributed, each distribution has 
the same unknown variance, and E(X;;,) = 4;; for k= 
1,..., K. Explain in words what the following hypotheses 
mean, and describe how to carry out a test of them: 


Ay: M11 — M12 = Ha — B22, 
H,: The hypothesis Hp is not true. 


15. Suppose that W,, W>, and W3 are independent random 
variables, each of which has a normal distribution with the 
following means and variances: 


E(W,)=0,+6),  Var(W,) =0°, 
E(W>) = 0, + 0 —5, Var(W>) =07, 
E(W3) = 20; — 26), Var(W3) = 407. 


Determine the M.L.E.’s of 6;, 0), and 0”, and determine 
also the joint distribution of these estimators. 


16. Suppose that it is desired to fit a curve of the form 
y =ax? to a given set of n points (x;, y;) with x; > 0 and 
y; > Ofori =1,..., 2”. Explain how this curve can be fitted 
either by direct application of the method of least squares 
or by first transforming the problem into one of fitting a 
straight line to the n points (log x;, log y;) and then ap- 
plying the method of least squares. Discuss the conditions 
under which each of these methods is appropriate. 


17. Consider a problem of simple linear regression, and 
let e; = Y; — By — fix; denote the residual of the obser- 
vation Y; (( =1,...,m), as defined in Definition 11.3.2. 
Evaluate Var(e;) for given values of x1, ..., x,, and show 
that it is a decreasing function of the distance between x; 
and x. 


18. Consider a general linear model withn x p design ma- 
trix Z,andlettW=Y—-Z B denote the vector of residuals. 
(In other words, the ith coordinate of W is Y; — Y;, where 
Y; = 2;0Bo + +++ + Zip—1Bp-t- 

a. Show that W = DY, where 

D=1-Z(Z'Z)'Z’. 
b. Show that the matrix D is idempotent; that is, DD = 
D. 

c. Show that Cov(W ) = 02D. 
19. Consider a two-way layout in which the effects of 
the factors are additive so that Eq. (11.7.1) is satisfied, 


and let v,...,v; and w,,..., wy be arbitrary given 
positive numbers. Show that there exist unique numbers 


fk, @y,...,@,, and f,,..., By such that 
I J 
Y ya; = yp =0 
i=1 j=l 

and 


E(¥;)) =u+a; + B; fori=1,...,7 andj =1,..., J. 
20. Consider a two-way layout in which the effects of 
the factors are additive, as in Exercise 19; suppose also 
that there are K;; observations per cell, where K;; > 0 
fori=1,...,/ and j=1,..., J. Let v;=K;, fori= 
| een J,andletw;=K,,forj=1,..., J. Assume that 
E(Y;jx) ioe + Ot; + B; fork=1,...,Kjj,i=1,..., j,and 
j=1,..., J, where y Vj; = Le w 8; =0, as in Ex- 
ercise 19. Verify that the least-squares estimators of j1, a;, 
and 8; are as follows: 


A=Yiyys 

@; = —Yj44—-Youy fore = ese 3 J, 
i+ 

x 1 = . 

Bees gj Fa for 7 Sly 3.235. Ji 
Ky; 


21. Consider again the conditions of Exercises 19 and 
20, and let the estimators f1, @;, and B ; be as given in 


Exercise 20. Show that Cov(j1, @;) = Cov(j1, B;) =0. 


22. Consider again the conditions of Exercise 19 and 20, 
and suppose that the numbers K;; have the following pro- 
portionality property: 

Ki Ky; 


= fori=1,...,7andj=1,..., J. 
nN 


ij 


Show that Cov(é;, 8 ;) =0, where the estimators @; and B F 
are as given in Exercise 20. 


23. In a three-way layout with one observation in each 
cell, the observations Yiix (CS 1, cag de PS he eS 
k=1,..., K) are assumed to be independent and nor- 
mally distributed, with a common variance 07. Suppose 


that E(Y;;<) = 9;;x. Show that for every set of numbers 
0;j,, there exists a unique set of numbers pi, a, a? 
ar, BA *, Be, Bre atid Yap H1sse Te fa diss dS: 
k=1,..., K) such that 


RQ 
II 


A BC 
(=a, =a, =0, 
AB AB AC AC BC BC 
Be Py; =P, = Pa =F, =P = 9: 


Vijt = Vitk = V+je = 9, 


and 


A B (oy AB AC BC 
Ojjk =U + ar 4 a; Fa, 4 Bij + Bi 4 Bix b Vijk 


for all values of i, j, and k. 


24. The 2000 USS. presidential election was very close, es- 
pecially in the state of Florida. Indeed, newscasters were 
unable to predict a winner the day after the election be- 
cause they could not decide who was going to win Florida’s 
25 electoral votes. Many voters in Palm Beach County 
complained that they were confused by the design of the 
ballot and might have voted for Patrick Buchanan instead 
of Al Gore, as they had intended. Table 11.32 displays the 
official ballot counts (after all official recounts) for each 
county. There was no reason, prior to the election, to think 
that Patrick Buchanan would gather a significantly higher 
percent of the vote in Palm Beach County than in any 
other Florida county. 


a. Draw a plot of the vote count for Patrick Buchanan 
against the total vote count with one point for each 
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county. Identify the point corresponding to Palm 
Beach County. 


. Given the complaints about the Palm Beach ballot, 


it might be sensible to treat the data point for Palm 
Beach County as being different from the others. Fit 
a simple linear regression model with Y being the 
vote for Buchanan and X being the total vote in each 
county, excluding Palm Beach County. 


. Plot the residuals from the regression in part (b) 


against the X variable. Do you notice any pattern in 
the plot? 


. The variance of the vote for each candidate in a 


county ought to depend on the total vote in the 
county. The larger the total vote, the more variance 
you expect in the vote for each candidate. For this 
reason, the assumptions of the simple linear regres- 
sion model would not hold. As an alternative, fit a 
simple linear regression with Y being the logarithm 
of the vote for Buchanan and X being the logarithm 
of the total vote in each county. Continue to exclude 
Palm Beach County. 


. Plot the residuals from the regression in part (d) 


against the X variable. Do you notice any pattern in 
the plot? 


. Using the model fit in part (d), form a 99 percent pre- 


diction interval for the Buchanan vote (not the loga- 
rithm of the Buchanan vote) in Palm Beach County. 


. Let B be the upper endpoint of the interval you 


found in part (f). Just suppose that the actual num- 
ber of people in Palm Beach County who voted for 
Buchanan had actually been B instead of 3411. Also 
suppose that the remaining 3411 — B voters had ac- 
tually voted for Gore. Would this have changed the 
winner of the total popular vote for the State of 
Florida? 
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Table 11.32 County votes for Bush, Gore, and Buchanan in the 2000 presidential election for the state of Florida. The 
total column includes all 11 candidates that were on the ballot. The absentee row includes overseas absentee 
ballots that were not included in individual county totals. These data came from the official state of Florida 
election Web site, which has since been moved or deleted. 


County Bush Gore Buchanan Total County Bush Gore Buchanan Total 

Alachua 34,124 47,365 263 85,729 Lee 106,141 73,560 305 184,377 
Baker 5610 2392 73 8154 Leon 39,062 61,427 282 103,124 
Bay 38,637 18,850 248 58,805 Levy 6858 5398 67 12,724 
Bradford 5414 3075 65 8673 Liberty 1317 1017 39 2410 
Brevard 115,185 97,318 570 218,395 Madison 3038 3014 29 6162 
Broward 177,902 387,703 795 575,143 Manatee 57,952 49,177 271 110,221 
Calhoun 2873 2155 90 5174 Marion 55,141 44,665 563 102,956 
Charlotte 35,426 29,645 182 66,896 Martin 33,970 26,620 112 62,013 
Citrus 29,767 25,525 270 57,204 Miami-Dade 289,533 328,808 560 625,449 
Clay 41,736 14,632 186 57,353 Monroe 16,059 16,483 47 33,887 
Collier 60,450 29,921 122 92,162 Nassau 16,404 6952 90 23,780 
Columbia 10,964 7047 89 18,508 Okaloosa 52,093 16,948 267 70,680 
Desoto 4256 3320 36 7811 Okeechobee 5057 4588 43 9853 
Dixie 2697 1826 29 4666 Orange 134,517 = 140,220 446 280,125 
Duval 152,098 107,864 652 264,636 Osceola 26,212 28,181 145 55,658 
Escambia 73,017 40,943 502 116,648 Palm Beach = 152,951 = 269,732 3411 433,186 
Flagler 12,613 13,897 83 27,111 Pasco 68,582 69,564 570 142,731 
Franklin 2454 2046 33 4644 Pinellas 184,825 200,630 1013 398,472 
Gadsden 4767 9735 38 14,727 Polk 90,295 75,200 533 168,607 
Gilchrist 3300 1910 29 5395 Putnam 13,447 12,102 148 26,222 
Glades 1841 1442 9 3365 Santa Rosa 36,274 12,802 311 50,319 
Gulf 3550 2397 71 6144 Sarasota 83,100 72,853 305 160,942 
Hamilton 2146 1722 23 3964 Seminole 75,677 59,174 194 137,634 
Hardee 3765 2339 30 6233 St. Johns 39,546 19,502 229 60,746 
Hendry 4747 3240 22 8139 St. Lucie 34,705 41,559 124 77,989 
Hernando 30,646 32,644 242 65,219 Sumter 12,127 9637 114 22,261 
Highlands 20,206 14,167 127 35,149 Suwannee 8006 4075 108 12,457 
Hillsborough 180,760 169,557 847 360,295 Taylor 4056 2649 27 6808 
Holmes 5011 2177 76 7395 Union 2332 1407 37 3826 
Indian River 28,635 19,768 105 49,622 Volusia 82,357 97,304 498 183,653 
Jackson 9138 6868 102 16,300 Wakulla 4512 3838 46 8587 
Jefferson 2478 3041 29 5643 Walton 12,182 5642 120 18,318 
Lafayette 1670 789 10 2505 Washington 4994 2798 88 8025 


Lake 50,010 36,571 289 88,611 Absentee 1575 836 5 2490 
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12.1 What Is Simulation? 


Simulation is a way to use high-speed computer power to substitute for analytical 
calculation. The law of large numbers tells us that if we observe a large sample of 
i.i.d. random variables with finite mean, then the average of these random variables 
should be close to their mean. If we can get a computer to produce such a large 
iid. sample, then we can average the random variables instead of trying (and 
possibly failing) to calculate their mean analytically. For a specific problem, one 
needs to figure out what types of random variables one needs, how to make a 
computer produce them, and how many one needs in order to have any confidence 
in the numerical result. Each of these issues will be addressed to some extent in this 
chapter. 


Proof of Concept 


We begin with some simple examples of simulation to answer questions that we can 
already answer analytically just to show that simulation does what it advertises. Also, 
these simple examples will raise some of the issues to which one must attend when 
trying to answer more difficult questions using simulation. 


The Mean of a Distribution. The mean of the uniform distribution on the interval 
[0, 1] is known to be 1/2. If we had available a large number of i.i.d. uniform random 
variables on the interval [0, 1], say, X,..., X,,, the law of large numbers says that 
x= i ;_1 X; Should be close to the mean 1/2. Table 12.1 gives the averages of 
several different simulated samples of size n from the uniform distribution on [0, 1] 
for several different values of n. It is not difficult to see that the averages are close 
to 0.5 in most cases, but there is quite a bit of variation, especially for n = 100. There 
seems to be less variation for n = 1000, and even less for the two largest values of n. 

< 


A Normal Probability. The probability that a standard normal random variable is at 
least 1.0 is known to be 0.1587. If we had available a large number of i.i.d. stan- 
dard normal random variables, say, X;,..., X,, we could create Bernoulli random 
variables Y,,..., Y,, defined by Y; =1if X; > 1.0 and Y; = 0 if not. Then the law of 
large numbers says that Y = 1 >-;_ Y; should be close to the mean of Y;, namely, 
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Table 12.1 Results of several different simulations in Example 12.1.1 


n Replications of the simulation 


100 0.485 0.481 0.484 0.569 0.441 
1000 0.497 0.506 0.480 0.498 0.499 
10,000 0.502 0.501 0.499 0.498 0.498 
100,000 0.502 0.499 0.500 0.498 0.499 


Table 12.2 Results of several different simulations in Example 12.1.2 


n Replications of the simulation 


100 0.16 0.18 0.17 0.22 0.14 
1000 0.135 0.171 0.174 0.159 0.171 
10,000 0.160 0.163 0.158 0.152 0.156 
100,000 0.158 0.158 0.158 0.159 0.161 


Pr(X; > 1.0) = 0.1587. Notice that Y is merely the proportion of the simulated X; 
values that are at least 1.0. Table 12.2 gives the proportions of X; > 1.0 for several 
different simulated samples of size n from the standard normal distribution for sev- 
eral different values of n. It is not difficult to see that the proportions are somewhat 
close to 0.1587, but there is still quite a bit of variability from one simulation to the 
next. < 


As we mentioned earlier, there is no need for simulation in the above examples. 
These were just to illustrate that simulation can do what it claims. However, one 
needs to be aware that, no matter how large a sample is simulated, the average of 
an 1.i.d. sample of random variables is not necessarily going to be equal to its mean. 
One needs to be able to take the variability into account. The variability is apparent 
in Tables 12.1 and 12.2. We shall address the issue of the variability of simulations 
later in the chapter. 

The reader might also be wondering how we obtained all of the uniform and 
normal random variables used in the examples. Virtually every commercial statistical 
software package has a simulator for i.i.d. uniform random variables on the interval 
[0, 1]. Later in the chapter, we shall discuss ways to make use of these for simulating 
other distributions. One such method was already discussed in Chapter 3 on page 170. 


Examples in which Simulation Might Help 


Next, we present some examples where the basic questions are relatively simple to 
describe, but analytic solution would be tedious at best. 


Waiting for a Break. Two servers, A and B, in a fast-food restaurant start serving 
customers at the same time. They agree to meet for a break after each of them has 


Figure 12.1 Histogram of 
sample of 10,000 simulated 


waiting times Z in Exam- 


ple 12.1.3. 


Example 


12.1.4 
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Waiting time Z 


served 10 customers. Presumably, one of them will finish before the other and have 
to wait. How long, on average, will one of the servers have to wait for the other? 

Suppose that we model all service times, regardless of the server, as i.1.d. ran- 
dom variables having the exponential distribution with parameter 0.3 customers per 
minute. Then the time it takes one server to server 10 customers has the gamma 
distribution with parameters 10 and 0.3. Let X be the time it takes A to serve 10 
customers, and let Y be the time it takes B to serve 10 customers. We are asked to 
compute the mean of |X — Y|. The most straightforward way of finding this mean 
analytically would require a two-dimensional integral over the union of two non- 
rectangular regions. 

On the other hand, suppose that a computer can provide us with as many 
independent gamma random variables as we desire. We can then obtain a pair (X, Y) 
and compute Z = |X — Y|. We then repeat this process independently as many times 
as we want and average all the observed Z values. The average should be close to the 
mean of Z. 

Without going into details, we actually simulated 10,000 pairs of (X, Y) values and 
averaged the resulting Z values to get 11.71 minutes. A histogram of the simulated 
Z values is in Fig. 12.1. As a confidence builder, we simulated another 10,000 pairs 
and got an average of 11.77. S| 


Long Run of Heads. You overheard someone say that they just got 12 consecutive 
heads while flipping a seemingly fair coin. The probability of getting 12 heads in a 
row in 12 independent flips of a fair coin is (0.5)!”, a very small number. If the person 
had obtained 12 tails in a row, you probably would have heard about that instead. 
Even so, the probability of 12 of the same side is only (0.5)'". But then you learn that 
the person actually flipped the coin 100 times, and the 12 heads in a row appeared 
somewhere during those 100 flips. Presumably, you are less surprised to learn that 
the person got a run of 12 of the same side somewhere in a sequence of 100 flips. But 
how much larger is the probability of a run of 12 when one flips 100 times? 

Suppose that we can make a computer flip a fair coin as many times as we wish. 
We could ask it to flip 100 times and then check whether there was a run of length 12 
or more. Let X = 1if there is a run of 12 or more, and let X = Oif not. We then repeat 
this process independently as many times as we want and average all the observed 
X values. The average should be close to the mean of X, which is the probability of 
obtaining a run of 12 or more in 100 flips. 

Without going into details, Fig. 12.2 shows a histogram of the longest runs in 
10,000 repetitions of the experiment described above. For each of the 10,000 runs, 
we calculated X as above and found the average to be 0.0214, still a small number, 
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Figure 12.2 Histogram of 
sample of 10,000 longest runs 
(head or tail). Each run was 
observed in 100 flips of a fair 
coin. 
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but not nearly so small as (0.5)!!. We also repeated the calculation of the average 
with another 10,000 sets of 100 flips and got 0.0229. <l 


A number of details were left out of exactly how the simulations were performed 
in the above examples. However, it is clear what random variables we wanted to 
observe, namely, Z in Example 12.1.3 and X in Example 12.1.4. Many simulations can 
address more than one question. For instance, in Example 12.1.4, we recorded the 
10,000 lengths of the longest runs even though our primary interest was in whether or 
not the longest run was 12 or more. We could also have tried to calculate the expected 
length of the longest run or other properties of the distribution of the longest run. In 
Example 12.1.3, we could have tried to approximate the probability that one person 
has to wait at least 15 minutes, etc. 

Figures 12.1 and 12.2 illustrate that there is variation among the 10,000 repeti- 
tions of a simulated experiment. Furthermore, each of the examples showed that a 
complete rerunning of all 10,000 simulated experiments can be expected to produce a 
different answer to each of our questions. How much different the answers should be 
is a matter that we shall address in Sec. 12.2, where we use the Chebyshev inequality 
and the central limit theorem to help us decide how many times to repeat the basic 
experiment. Exactly how one simulates 100 flips of a coin or a pair of gamma random 
variables will be taken up in Sec. 12.3. 


Summary 


Suppose that we want to know the mean of some function g of a random variable 
or random vector W. For instance, in Example 12.1.3 we can let W = (X, Y) and 
g(W) = |X — Y|. Ifa computer can supply a large number of 1.i.d. random variables 
(or random vectors) with the distribution of W, one can use the average of the 
simulated values of g(W) to approximate the mean of g(X). One must be careful 
to take the variability in g(W) into account when deciding how much confidence to 
place in the approximation. 


For each of the exercises in this section, you could also perform the simulations 
described with various numbers of replications if you have appropriate computer 
software available. Most of the distributions involved are commonly available in 
computer software. If a distribution is not available, the simulations can wait until 
methods for simulating specific distributions are introduced in Sec. 12.3. 


1. Assume that one can simulate as many i.1.d. exponen- 
tial random variables with parameter 1 as one wishes. Ex- 
plain how one could use simulation to approximate the 
mean of the exponential distribution with parameter 1. 


2. If X has the p.d-f. 1/x? for x > 1, the mean of X is infi- 
nite. What would you expect to happen if you simulated 
a large number of random variables with this p.d.f. and 
computed their average? 


3. If X has the Cauchy distribution, the mean of X does 
not exist. What would you expect to happen if you sim- 
ulated a large number of Cauchy random variables and 
computed their average? 


4. Suppose that one can simulate as many 1.i.d. Bernoulli 
random variarbles with parameter p as one wishes. Ex- 
plain how to use these to approximate the mean of the 
geometric distribution with parameter p. 


5. Two servers A and B ina fast-food restaurant each start 
their first customers at the same time. After finishing her 
second customer, A notices that B has not yet finished 
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his first customer. A then chides B for being slow, and 
B responds that A just got a couple of easier customers. 
Suppose that we model all service times, regardless of the 
server, as i.i.d. random variables having the exponential 
distribution with parameter 0.4. Let X be the sum of the 
first two service times for server A, and let Y be the first 
service time for server B. Assume that you can simulate as 
many i.i.d. exponential random variables with parameter 
0.4 as you wish. 


a. Explain how to use such random variables to approx- 
imate Pr(X < Y). 


b. Explain why Pr(X < Y) is the same no matter what 
the common parameter is of the exponential dis- 
tribuions. That is, we don’t need to simulate exponen- 
tials with parameter 0.4. We could use any parameter 
that is convenient, and we should get the same an- 
swer. 


c. Find the joint p.df of X and Y, and write the 
two-dimensional integral whose value would be 
Pr(X <Y). 
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12.2 Why Is Simulation Useful? 


Statistical simulations are used to estimate features of distributions such as means 
of functions, quantiles, and other features that we cannot compute in closed form. 
When using a simulation estimator, it is good to compute a measure of how precise 
the estimator is, in addition to the estimate itself 


Examples of Simulation 


Simulation is a technique that can be used to help shed light on how a complicated 
system works even if detailed analysis is unavailable. For example, engineers can 
simulate traffic patterns in the vicinity of a construction project to see what effects 
various proposed restrictions might have. A physicist can simulate the behavior of 
gas molecules under conditions that are covered by no known theory. Statistical 
simulations are used to estimate probabilistic features of our models that we cannot 
compute analytically. Because simulation introduces an element of randomness into 
an analysis, it is sometimes called Monte Carlo analysis, named after the famous 
European gambling center. 


The M.S.E. of the Sample Median. Suppose that we are about to observe a random 
sample of size n from a Cauchy distribution centered at an unknown value jz. The 
p.d.f. of each observation is 


1 
fa) =—A+[e- up, 


and the parameter jx is the median of the distribution. Suppose that we are interested 
in how well the sample median M performs as an estimator of jz. In particular, we 
want to calculate the M.S.E. E({[M — yx). If we could generate a sample of n random 
variables from a Cauchy distribution centered at 4, we could compute the sample 
median M and calculate Y = (M — 1)”. The M.S.E. is then 6 = E(Y). If we could 
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generate a large number v of i.i.d. random variables with the same distribution as Y, 
say, Y,..., Y™, then the law of large numbers would tell us that Z = i pare a 
should be close to @. To do this, we could generate nv i.i.d. Cauchy random variables 
centered at . Then we could divide them into v sets of n each and use each set of n to 
compute a sample median M fori =1,..., v and then compute Y“ = (M© — y)?. 
This is actually how several of the numbers in the tables in Sec. 10.7 were computed. 
These tables contain the M.S.E.’s of various estimators computed from random 
samples with various distributions. For example, the numbers corresponding to the 
sample median in Table 10.39 on page 675 are precisely what we have been discussing 
in this example. 4 


Note: Notation to Distinguish Simulations. We shall use superscripts in parenthe- 
ses to distinguish different simulated values of the same random variable from each 
other. For instance, in Example 12.2.1, we used Y to stand for the ith simulated 
value of Y. In what follows, we may be simulating subscripted random variables. For 
example, Ve ) would stand for the jth simulated value of ju;. 

Example 12.2.1 illustrates the main features of many statistical simulations. Sup- 
pose that the quantity in which we are interested can be expressed as the expected 
value of some random variable that has the distribution F. Then we should try to 
generate a large sample of random variables with the distribution F and average 
them. It is often the case, as in Example 12.2.1, that the distribution F is itself very 
complicated. In such cases, we need to construct random variables with the distri- 
bution F from simpler random variables whose distributions are more familiar. In 
Example 12.2.1, the M.S.E. is the mean of the random variable Y = (M — )”, where 
M is itself the sample median of a sample of n Cauchy random variables centered at 
j. We cannot easily simulate a random variable with the distribution of Y in one step, 
but we can simulate n Cauchy random variables and then find their sample median 
M and finally compute Y = (M — 2)”, which will have the desired distribution. We 
then repeat the simulation of Y many times. 

Not all statistical simulations involve the mean of a random variable. 


The Median of a Complicated Distribution. Let X be an exponential random variable 
with unknown parameter jz. Suppose that yz has a distribution with the p.d.f. g. We 
are interested in the median of X. The marginal distribution of X has the p.d.f. 


fo)= ‘ pe g(u)dp. 


This integral might not be one that we can compute. However, suppose that we can 
generate a large sample of random variables n, ... , w“”) having the p.d.f. g. Then, 
for eachi =1,..., v, we can simulate xX having the exponential distribution with 
parameter . The random variables X, ..., X“) would then be a random sample 
from the distribution with the p.d.f. f. The median of the sample X", ..., X“) should 
be close to the median of the distribution with the p.d.f. f. < 


A Clinical Trial. Consider the four treatment groups described in Example 2.1.4 on 
page 57. For i = 1, 2, 3, 4, let P; be the probability that a patient in treatment group 
i will not relapse after treatment. We might be interested in how likely it is that the 
P,’s differ by certain amounts. We might assume that the P,’s are independent a priori 
with beta distributions having parameters ag and fo. The posterior distributions of the 
P,’s are also independent beta distributions with parameters ap + x; and By +; — x;, 
where n; is the number of subjects in group /, and x; is the number of patients in group 
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i who do not relapse. We could simulate a large number v of vectors (P;, P2, P3, P4) 
with the above beta distributions. Then we could try to answer any question we 
wanted about the posterior distribution of (P;, P2, P3, P4). For example, we could 
estimate Pr(P; > Py) for i =1, 2, 3, where i = 4 stands for the placebo group. This 
probability tells us how likely it is that each treatment is better than no treatment. 
We could estimate Pr(P; > P4) by finding the proportion of sampled (P;, P2, P3, P4) 
vectors in which the ith coordinate is greater than the fourth coordinate. We could 
also estimate the probability that P; is the largest, or the probability that all four P; 
are within € of each other. 4 


Comparing Two Normal Means with Unequal Variances. On page 593 in Chapter 9, we 
considered how to test hypotheses concerning the means of two different normal 
distributions when the variances are unknown and different. This problem has a rel- 
atively simple solution in the Bayesian framework using simulation. Our parameters 
will be ,, T,, y, and t,. Conditional on the parameters, let X1,..., X,, be iid. 
having the normal distribution with mean j, and precision t,. Also let Y,, eee 
be iid. (and independent of the X’s) having the normal distribution with mean jj, 
and precision t,. Assume that we use natural conjugate priors for the parameters 
with (j2,, Ty) independent of (j,, Ty) in the prior distribution. (It is not necessary for 
the X parameters to be independent of the Y parameters, but it makes the presen- 
tation simpler.) Sec. 8.6 contains details on how to obtain the posterior distributions 
of the parameters. Since the X data and X parameters are independent of the Y data 
and Y parameters, we can calculate each posterior distribution separately. Let the 
hyperparameters of the posterior distribution of (w,, T,) be a1, Byy, x1, and A,4. 
Similarly, let the hyperparameters of the posterior distribution of (w,, t,) be a1, 
By1, Hy1, and 4,1. In order to test hypotheses about w, — wy, we need the posterior 
distribution of 2, — zy. This distribution is not analytically tractable. If we can simu- 
late a large collection of parameter vectors from their joint posterior distribution, we 
can compute j4, — #4, for each sampled vector, and these values will form a sample 
from the posterior distribution of jz, — w,. To be more specific, let v be a large num- 
ber, and for eachi =1,..., v, we want to simulate (u, w, 1, +) from the joint 
posterior distribution. To do this, we need to simulate independent gamma random 
variables eo and a with the appropriate posterior distributions. After simulating 


these, we can simulate pO from the normal distribution with mean ju, and variance 
1/(A,;t). Similarly, we can simulate pe ) from the normal distribution with mean 


jy, and variance 1/(A,,7). Then w? — w fori =1,..., visasample from the pos- 
terior distribution of 2, — j4,. We shall aiheeate this methodology: in Example 12.3.8 
after we discuss some methods for simulating pseudo-random numbers with various 
distributions. S| 


The simulation in Example 12.2.4 can be extended in a straightforward fashion 
to a comparison of three or more normal distributions with unequal variances. With 
more than two means to compare, questions arise about what exactly to calculate 
to summarize the comparison. That is, there is not just one difference like uw, — jr, 
that captures the differences between three or more means. We shall consider this 
situation in more detail in Examples 12.3.7 and 12.5.6. 


Estimating a Standard Deviation. Let X be arandom variable whose standard deviation 
@ is important to estimate. Suppose that we cannot calculate 6 in closed form, but we 
can simulate many pseudo-random values X |. , X with the same distribution 
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as X. Then we could compute the sample standard deviation 


1 1/2 
Sy = (: De = ¥] 
i=l 


as an estimator of 6, where X = + )>?_, X. Since S, is not an average, the law 
of large numbers does not tell us that it converges in probability to 6. However, 
if we let Y = X?, we can rewrite S, as (Y — X’)'/. In this form, we see that = 
g(X, Y), where g(x, y) = (y — x”)!/?. Notice that g is continuous at every point (x, y) 
such that y > x”. The law of large numbers tells us that Y converges in probability 
to E(X*) and that X converges in probability to E(X). Since E(X?) > E(X), we 
can apply Exercise 16 in Sec. 6.2 to conclude that S$, converges in probability to 
g(E(X), E(X*)) =0. 4 


All of the examples above involve the generation of a large number of random 
variables with specific distributions. Some discussion of this topic appeared in Chap- 
ter 3 beginning on page 170. Sections 12.3 and 12.5 will also discuss methods for 
generating random variables with specific distributions. Sections 12.4 and 12.6 will 
present particular classes of problems in which statistical simulation is used success- 
fully. 


Which Mean Do You Mean? 


Simulation analyses add an additional layer of probability distributions and sam- 
pling distributions of statistics to an already probability-laden statistical analysis. A 
typical statistical analysis involves a probability model for a random sample of data 
X1,..., X,. This probability model specifies the distribution of each X,;, and this 
distribution might have parameters such as its mean, median, variance, and other 
measures that we are interested in estimating or testing. We then form statistics (func- 
tions of the data), say, Y. These functions might include sample versions of the very 
parameters that we wish to estimate, such as a sample mean, sample median, sample 
variance, and the like. The distribution of Y has been called its sampling distribu- 
tion. This sampling distribution also might have a mean, median, variance, and other 
measures that we need to calculate or deal with in some way. So far, we have three ver- 
sions of mean, median, variance, and others, and we have not even begun discussing 
simulation. 

A simulation analysis might be used to try to estimate a parameter 0 of the 
sampling distribution of the statistics Y. Typically, one would simulate i.i.d. pseudo- 
random Y”,..., Y) each with the same distribution as (the sampling distribution 
of) Y. We then compute a summary statistic Z of Y), ..., Y and use Z to estimate 
0. This Z might itself be a sample mean, sample median, sample variance, or other 
measure of the Y,..., Y sample. The distribution of Z will be called its simula- 
tion distribution or Monte Carlo distribution. Features of the simulation distribution, 
such as its mean, median, and variance, will be called the simulation mean, simula- 
tion median, and simulation variance to make clear to which level we have climbed 
in this ever-expanding tree of terminology. Here is an example to illustrate all of the 
various levels. 


Five or More Variances. Let X),..., X,, be i.i.d. random variables with a continuous 
distribution having c.d.f. F. Let w denote the variance of X;. Suppose that we decide 
to use the sample variance Y = )*"_,(X; — X)*/n to estimate y. As part of deciding 
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Table 12.3 Levels of probability distributions, statistics, and parameters in a typical simulation analysis 


Distribution (D) or sample (S) Parameter (P) or statistic (S) 


(D) Population distribution F (P) Mean, variance, median, etc. w 


(S) Sample X = (Xj, .. 


(D) Sampling distribution G of Y (P) Mean, variance, median, etc., 9 of the sampling distribution 
of Y 
(S) Simulated sample Y = (Y,..., Y®) (S) Estimator Z of 6 based on Y, e.g., sample mean, sample 
from G variance, sample median, etc., of Y. 
(D) Simulation distribution H of Z (P) Variance of simulation distribution (simulation variance) 
(S) Simulated data (differs by example) (S) Estimator of simulation variance, (depends on specific 
example) 


., X,) from F (S) Estimator Y of w based on X, e.g., sample mean, sample 


variance, sample median, etc. 


how good Y is as an estimator of yw, we are interested in its variance 6 = Var(Y). 
That is, 0 is the variance of the sampling distribution of Y. Suppose that we cannot 
calculate 6 in closed form, but suppose that it is easy to simulate from the distribution 
F. We might then simulate nv values x for j=1,...,v,i=1,...,n. For each j, 
we compute the sample variance Y‘) of the sample X : 2.., XY. That is, YY = 


Yak : )_ 2 /n. The Y\ values all have the same distribution as Y itself, the 
sampling distribution of Y. Since we are interested in Var(Y), we might compute the 
sample variance Z of the sample Y, ..., Y. That is, Z = 7)_,(Y — Y)*/v. We 
would then use Z to estimate 0. If Z is large, it suggests that Y has large variance, and 
so Y is not a very good estimator of y. Unless we are willing to collect more data or 
search for a better estimator, we are stuck with a poor estimator of w. 

Finally, Z might not be a good estimator of 6 because our simulation size v might 
not be large enough. If this is the case, we can simulate more Y“) values. That is, we 
can increase the simulation size v to get a better simulation estimator of 0. (This will 
not make Y a better estimator of y, but it will give us a better idea of how good 
or bad an estimator it is.) Hence, we shall also try to estimate the variance of Z (its 
simulation variance). Precisely how to do this varies from one example to the next, 
so we shall not give any details here. However, we shall explain how to estimate the 
simulation variance of Z for the most popular types of simulation later in this section. 

This estimation of variance has to end somewhere, and we shall end it with 
Var(Z). That is, we shall not try to assess how good our estimator of Var(Z) is. All of 
these levels of distributions and estimation are illustrated in Table 12.3. <l 


Example 12.2.6 is not intended to illustrate any simulation methodology. It is 
intended to illustrate the various levels at which probability concepts (such as vari- 
ance) and their sample versions enter into a simulation study of a statistical analysis. 
It is important to be able to tell which variance or which sample variance is being 
discussed if one is to avoid becoming hopelessly confused. In this chapter, we shall 
focus on the features of the simulated samples, in particular the simulation distribu- 
tion of statistics computed from the simulated samples. However, our examples will 
necessarily involve parameters and statistics that arose at earlier levels. Furthermore, 
the analysis of a simulation distribution will make use of the same methods (central 
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limit theorem, law of large numbers, delta method, etc.) that we learned how to use 
with nonsimulated data. 


Assessing Uncertainty about Simulation Results 


The last step in Example 12.2.6 (summarized in the last two rows of Table 12.3) is 
an important part of every simulation analysis. That is, we should always attempt to 
assess the uncertainty in a simulation. This uncertainty is most easily assessed via 
the simulation variance of the simulated quantity. For instance, in Example 12.2.1, 
let v = 1000 and 6 = 0. We can create 1000 samples of n Cauchy random variables, 
calculate M“, the median of the ith sample, and compute the value Y“) = (M“ — 0)?. 
We can then average the 1000 values of Y“. We could repeat this exercise several 
times, and we would not get the same result every time. This is due to the fact that, 
even with a large v like 1000, an estimator such as Z = 1 )_, Y is still a random 
variable with positive variance (its simulation variance). The smaller the simulation 
variance is, the more certain we can be that our estimator Z is close to what we are 
trying to estimate. But we need to estimate or bound the simulation variance before 
we can assess the amount of uncertainty. How we estimate the simulation variance of 
aresult Z depends on whether Z is an average of simulated values, a smooth function 
of one or more averages, or a sample quantile of simulated values. The square root 
of our estimate of the simulation variance will be called the simulation standard 
error, and it is an estimate of the simulation standard deviation of Z. The simulation 
standard error is a popular way to summarize uncertainty about a simulation for two 
reasons. First, it has the same units of measurement as the quantity that was estimated 
(unlike the simulation variance). Second, the simulation standard error is useful for 
saying how likely it is that the simulation estimator is close to the parameter being 
estimated. We shall explain this second point in more detail after we show how to 
calculate the simulation standard error in several common cases. 


The Simulation Standard Error of an Average. Suppose that the goal of the simulation 
analysis is to estimate the mean 6 of some random variable Y. The simulation 
estimator Z will generally be the average of a large number of simulated values. 
A straightforward way to estimate the simulation variance for an average is the 
following: Suppose that we simulate some quantity Y a large number v of times 
in order to estimate the mean @. That is, suppose that we simulate independent 
y®,..., ¥™ for large v. Suppose also that the estimator of 6 is Z= + 7?_, Y®, 
and each Y“ has mean @ and finite variance o*. The sample standard deviation of 
the sample Y™, ..., Y is the square root of the sample variance, namely, 


Fr 1/2 
il aa 

j=[- y® _y/ 122.1 

s=(ty0o-¥¥) (1221 


If v is large, then 6 should be close to o. The central limit theorem says that Z should 
have approximately the normal distribution with mean 6 and variance o”/v. Since we 
usually do not know o”, we shall estimate it by 6”. This makes our estimator of the 
simulation variance of Z equal to 6*/v, and the simulation standard error is 6 /v'/?. 

< 


The Simulation Standard Error of a Smooth Function of Another Estimator. Sometimes, 
after estimating a quantity y, we also wish to estimate a smooth function of it: g(). 
For example, we might need to estimate the square root or the logarithm of some 
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mean. Or, we might have estimated a variance 62, and now we want an estimator of 0, 
the corresponding standard deviation. In general, suppose that the parameter that we 
wish to estimate by simulation is 6 = g(y), where we already have an estimator W of 
w. Suppose further that our estimator W has approximately the normal distribution 
with mean y and variance o7/v, where v is large compared to o”. Finally, suppose that 
we also have an estimator 6 of o that we obtained while calculating W. For example, 
W might itself be the average of v i.i.d. simulated random variables ¥ with mean 
w and variance o?. In this case, Eq. (12.2.1) will be our estimator of o. Let Z = g(W) 
be our estimator of @. The delta method (see Sec. 6.3) says that Z has approximately 
the normal distribution with mean @ = g(y) and variance [g/(w)’'o*/v. For example, 
if gv) = w”, then W'” has approximately the normal distribution with mean y!/? 
and variance o*/[4yv]. We already have estimates of o and yw, so our simulation 
standard error of Z is |g/(W)|6/v!/?. < 


The Simulation Standard Error of a Sample Quantile. Suppose that the goal of a simu- 
lation analysis is to estimate the p quantile 6, of some distribution G. Typically, we 
simulate a large number v of pseudo-random values Y,..., Y) with distribution 
G and use the sample p quantile as our estimator. On page 676, we pointed out that 
the sample p quantile from a large random sample of size m has approximately the 
normal distribution with mean @, and variance p(1— p)/ [mg?(6,)], where g is the 
p.d.f. of the distribution G. All we care about right now is that this approximate vari- 
ance has the form o”/m, where o? = p(1— p)/8’(@p) is some number that does not 
depend on m. Suppose that we simulate k independent random samples each of size 
m from the distribution G. Typically, this is done by choosing the size v of the original 
simulated sample Y,..., ¥Y to be v =km, and then splitting the v simulated val- 
ues into k subsamples of size m each. Compute the sample p quantile of each of the 
k random samples and call these simulated sample p quantiles Z),..., Z;,. To make 
use of the approximate normal distribution for the sample quantiles, m needs to be 
large. Next, compute the sample standard deviation of Z;,..., Z,: 


l k 1/2 
_(1i _F? 
S= (; Le Z) : (12.29) 


where Z is the average of the k sample p quantiles. If we treat each Z; as a single 
simulation, then S? is an estimator of the variance of Z;. But we just pointed out that 
the variance of Z; is approximately o?/m. Hence, S? is an estimator of o7/m. In other 
words, an estimator of o is ¢ = mS. Finally, combine all k samples into a single 
sample of size v =km, and compute the sample p quantile Z as our Monte Carlo 
estimator of @,. As we noted earlier, Z has approximately the normal distribution 
with mean @, and variance o*/v. We just constructed an estimator 6 of o, so our 
estimator of the simulation variance of Z is 67/v = mS*/v = S*/k, and the simulation 
standard error is §/k'?. < 


The Simulation Standard Error of a Sample Variance. Suppose that the goal of a simu- 
lation analysis is to estimate the variance 0 of some estimator Y. (Example 12.2.6 
was based on such a situation.) Suppose that we simulate Y®,...,¥® and use 
Z= 1 y)_,(v — Y)? to estimate 6. We now need to estimate the simulation vari- 
ance of Z. We shall rewrite Z as a smooth function of two averages and then apply 
a two-dimensional generalization of the delta method (see Exercise 12) in order to 


‘ ‘ : ‘ i ; —_- g—=2 =. 
estimate the simulation variance. Let W = Y“” so that Z = W —Y’, where W is 
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the average of W, ..., W). Now Z is a smooth function of two averages. The two- 
dimensional delta method developed in Exercise 12 can be applied. (The details for 
this very case can be derived in Exercise 13.) The results of Exercise 13 provide the 
following approximation to the asymptotic variance of Z. First, compute the sam- 
ple variance of W, ..., W™ and call it V. Next, compute the sample covariance 
between the Y’s and W’s: 


Vv 
ess yYiv® —Y)w® - W). 
i=l 
The estimator of Var(Z) is then 


Var(Z) = *(a¥"z ~4YC 4 Vv). (12.2.3) 
UV 


Also, the simulation distribution of Z is approximately the normal distribution with 
mean @ and variance that is estimated by Eq. (12.2.3). The simulation standard error 
is the square root of (12.2.3). < 


Do We Have Enough Simulations? Let Z be our Monte Carlo estimator of some 
parameter @ based on v simulations. Now that we are able to estimate the simulation 
variance of Z, we can begin to answer questions about how close we think Z is to 6. 
We can also try to see if we need to do more simulations in order to be confident that 
Z is close enough to 6. Suppose, as in all of the cases considered so far, that Z has 
approximately the normal distribution with mean 6 and variance o7/v, where o7 is 
a number that does not depend on the simulation size. For each € > 0, 


Pr(|Z —6| <6) © 20(ev'/?/0) =I (12.2.4) 


where ® is the standard normal c.d.f. We can use this type of approximation to help 
us to say how likely it is that Z is close to 6. We can replace v!/?/o by 1 over the 
simulation standard error of Z in Eq. (12.2.4) to approximate the probability that 
|Z — 6| <e«. We can also use (12.2.4) to decide how many more simulations to do 
if v was not large enough. For example, suppose that we want the probability in 
Eq. (12.2.4) to be y. Then we should let 


v= EG ae (12.2.5) 


Since we will hardly ever know o ahead of time, it is common to estimate it by 
doing a preliminary simulation of size vg and computing ¢ based on that preliminary 
simulation. 


The M.S.E. of the Sample Median. It is not difficult to see that we can take « =0 in 
Example 12.2.1 without loss of generality. The reason is the following: Let M be 
the sample median of XJ”, ..., X where each X - is a Cauchy random variable 


centered at x. Then M“ — w is also the sample median of x —p,...,XO-p, 
and each X : — isa Cauchy random variable centered at 0. Because our calculation 


is based on the values Y“) =(M — y)? fori =1,..., v, we get the same result 
whether  =0 or not. So, let ~ =0. This makes Y = M7, and o? is now the 
variance of M“”, (Even though a Cauchy random variable does not even have a 
first moment defined, it can be shown that the sample median of at least nine i.i.d. 
Cauchy random variables has a finite fourth moment.) Suppose that we want our 
estimator Z=Y of @ to be within « = 0.01 of 6 with probability y = 0.95. That 
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is, we want Pr(|Z — 0| < 0.01) = 0.95. Since Z is an average, we can compute an 
estimate 6 of o by using Eq. (12.2.1). Suppose that we simulate vg = 1000 samples 
of size n = 20 from a Cauchy distribution, compute the 1000 values of Y, and then 
compute 6 = 0.3892. According to Eq. (12.2.5) with o replaced by 0.3892, we need 
v = [1.96 x 0.3892/0.01 = 5820. Hence, we need approximately 4820 additional 
simulations. < 


After performing any additional simulations suggested by Eq. (12.2.5), one 
should recompute c. Ifit is much larger than the preliminary estimate, then additional 
simulations should be performed. 


The Median of a Complicated Distribution. In Example 12.2.2, suppose that the p.d-f. 
g is the p.d.f. of the gamma distribution with parameters 3 and 1. Suppose that we 
want the probability to be 0.99 that our estimator of the median is within 0.001 of 
the true median. We begin with an initial simulation of size vg =10,000. We then 
simulate uw, ..., w9° from the gamma distribution with parameters 3 and 1. For 
each i, we simulate X having the exponential distribution with parameter 1. We 
treat X, ..., x10009) as k — 20 samples of size m = 500 each, and we compute the 
sample median Z), ..., Z9 of each of the 20 samples. After performing these initial 
simulations, suppose that we observe the value S = 0.01597 for Eq. (12.2.2). This 
makes 6 = 0.3570. Plugging this value into (12.2.5) for o with y = 0.99 and e = 0.001 
yields v = 845,747.4. This means that we need a total of 845,748 simulations to reach 
our desired level of confidence in the simulated result. Just to check, we simulated 
a total of 900,000 values and computed the sample median 0.2593 as well as a new 
value of S* based on k = 100 subsamples of size m = 6200 each. The new value of & 
is 0.4529. Substituting 0.4529 for o in Eq. (12.2.5) yields a new v =1,360,939, which 
means that we still need another 460,939 simulations. <1 


Simulating Real Processes 
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In many scientific fields, real physical or social processes are modeled as having ran- 
dom components. For example, stock prices are often modeled as having lognormal 
distributions as in Example 5.6.10. Many processes involving waiting times and ser- 
vice are modeled using Poisson processes. The simple probability models that have 
been developed earlier in this text are merely the building blocks of which such mod- 
els of real processes are constructed. Here, we shall give two examples of slightly 
more complicated models that can be constructed using the distributions we already 
know. The analyses of these models can be simplified by the use of simulation. 


Option Pricing. In Example 5.6.10, we introduced the formula of Black and Scholes 
(1973) for pricing options. In that example, the option was to buy shares at price 
q of a stock whose value at time u (in the future) is a random variable S, with 
a known lognormal distribution. Many financial analysts believe that the standard 
deviation o of log(S,,) in Example 5.6.10 should not be treated as a known constant. 
For example, we could treat o as arandom variable with a p.d.f. f(a). To be precise, 
we shall continue to assume that S, = Sye-27/ 2yupoull?Z , but now we shall assume 
that both Z and o are random variables. For convenience, we shall assume that 
they are independent. We shall let Z have the standard normal distribution, and 
we shall let t = 1/07 have the gamma distribution with known parameters a and 
fb. The parameters a and 6 might result from estimating the variance of stock prices 
based on historical data combined with expert opinion of stock analysts. For example, 
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they might be the posterior hyperparameters that result from applying a Bayesian 
analysis to a sample of stock prices. It is easy to see that E(S,|o) = Soe’ for all 
o, and hence the law of total probability for expectations (Theorem 4.7.1) implies 
that E(S,,) = Soe". This is what we need for risk neutrality. The price for the option 
considered in Example 5.6.10 is the mean of the random variable e~"“h(S,,), where 


s—q ifs>q, 
0 otherwise. 


h(s) -{ 


The Black-Scholes formula (5.6.18) is just the conditional mean of e~"“h(S,,) given 
o. To estimate the marginal mean of e~"“h(S,,), we could simulate a large number of 
values o (i =1,..., v) from the distribution of o, substitute each o) into (5.6.18), 
and average the results. 

As an example, suppose that we take the same numerical situation from the 
end of Example 5.6.10 with u = 1, r = 0.06, and q = Sp. This time, suppose that 1/07 
has the gamma distribution with parameters 2 and 0.0127. (These numbers make 
E(o) =0.1, but o has substantial variability.) We can sample v =1,000,000 values of 
o from this distribution and compute (5.6.18) for each value. The average, in our 
simulation, is 0.0756Sp, and the simulation standard error is 1.8145) x 10~>. The 
option price is only slightly higher than it was when we assumed that we knew o. 
When the distribution of S,, is even more complicated, one can simulate S, directly 
and estimate the mean of h(S,). <1 


In the following example, each simulation requires a large number of steps, 
but each step is relatively simple. The combination of several simple steps into one 
complicated step is very common in simulations of real processes. 


A Service Queue with Impatient Customers. Consider a queue to which customers 
arrive according to a Poisson process with rate 2 per hour. Suppose that the queue 
has a single server. Each customer who arrives at the queue counts the length r of the 
queue (including the customer being served) and decides to leave with probability 
p,,forr =1,2,....A customer who leaves does not enter the queue. Each customer 
who enters the queue waits in the order of arrival until the customer immediately in 
front is done being served, and then moves to the head of the queue. The time (in 
hours) to serve a customer, after reaching the head of the queue, is an exponential 
random variable with parameter uw. Assume that all service times are independent 
of each other and of all arrival times. 

We can use simulation to learn about the behavior of such a queue. For example, 
we could estimate the expected number of customers in the queue at a particular 
time t after the queue opens for business. To do this, we could simulate many, say, 
v, realizations of the queue operation. For each realization i, we count how many 
customers N“ are in the queue at time t. Then our estimator is 1 7’_, N. To sim- 
ulate a single realization, we could proceed as follows: Simulate interarrival times 
X1, Xz, ... of the Poisson process as 1.1.d. exponential random variables with param- 
eter A. Let T; = se X; be the time at which customer j arrives. Stop simulating at 
the first k such that 7, >t. Only the first k — 1 customers have even arrived at the 
queue by time f. For each j =1,..., k — 1, simulate a service time Y; having the ex- 
ponential distribution with parameter jz. Let Z; stand for the time at which the jth 
customer reaches the head of the queue, and let W; stand for the time at which the 
jth customer leaves the queue. For example, Z; = X; and W; = X,+ Yj. For j > 1, 
the jth customer first counts the length of the queue and decides whether or not to 
leave. Let U; ; = 1if customer i is still in the queue when customer j arrives (i < j), 
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and let U;,; = 0 if customer i has already left the queue. Then 


. =I, if W; > T;, 
td 0 otherwise. 


The number of customers in the queue when the jth customer arrives is r = 


pe U;,;. We then simulate a random variable V; having the Bernoulli distribu- 
tion with parameter p,. If V; =1, customer j leaves the queue so that W; = 7). If 
customer j stays in the queue, then this customer reaches the head of the queue at 


time 
Zj = max{T;, Wi, ate et Wj_1}- 


That is, the jth customer either reaches the head of the queue immediately upon 
arrival (if nobody is still being served) or as soon as all of the previous j — 1 customers 
have left, whichever comes later. Also, W; = Z; + Y; if customer j stays. For each 
j=1,...,k—1, the jth customer is in the queue at time ¢ if and only if W; > t. 

As a numerical example, suppose that A = 2, 4 = 1, t =3, and p, =1-—1/r, for 
r > 1. Suppose that the first k = 6 simulated interarrival times are 


0.215, 0.713, 1.44, 0.174, 0.342, 0.382. 


The sum of the first five of these times is 2.884, but the sum of all six is 3.266. So, at 
most five customers are in the queue at time t = 3. Suppose that the simulated service 
times for the first five customers are 


0.251, 2.215, 2.855, 0.666, 2.505. 


We cannot simulate the V;’s in advance, because we do not yet know how many 
customers will be in the queue when each customer j arrives. Figure 12.3 shows a 
time line of the simulation of the process that we are about to describe. Begin with 
customer 1, who has T; = Z; = 0.215 and W, = 0.215 + 0.251 = 0.466. For customer 
2, To = T, + 0.713 = 0.928 > W,, so nobody is in the queue when customer 2 arrives 
and Z, = T, = 0.928. Then W> = Z, + 2.215 = 3.143. For customer 3, 73 = T) + 1.44 = 
2.368 < W>, so r=1. Because p,; =0, customer 3 stays, and there is no need to 
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Figure 12.3 One simulation of a service queue. The 
bottom line is the time line for Example 12.2.14. Each 
customer is represented by one horizontal line segment. 
The vertical line at t =3 crosses the horizontal lines for 
those customers still in the queue at time ¢ = 3. 


802 Chapter 12 Simulation 


simulate V3. Then Z3 = W) = 3.143, and W3 = Z3 + 2.855 = 5.998. For customer 4, 
Ty = T; + 0.174 = 2.542. Since W, < Ty < W>, W3, we have r =2 customers in the 
queue. We then simulate V, having the Bernoulli distribution with parameter p) = 
1/2. Suppose that we simulate V,=1, so customer 4 leaves, and we ignore the 
fourth simulated service time. This makes W4 = Ty = 2.542. For customer 5, T5 = 
Ty + 0.342 = 2.884, and customers 2 and 3 are still in the queue. We need to simulate 
V; having the Bernoulli distribution with parameter p> = 1/2. Suppose that V; = 0, so 
customer 5 stays. Then Zs = W3 = 5.988, and Ws = Zs + 2.505 = 8.393. Finally, W; > 3 
for j =2, 3, 5. This means that there are N ) — 3 customers in the queue at time t = 3, 
as illustrated in Fig. 12.3. Needless to say, a computer should be programmed to do 
this calculation for a large simulation. < 
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Exercises 


Summary 


If we wish to compute the expected value 6 of some random variable Y, but cannot 
perform the necessary calculation in closed form, we can use simulation. In general, 
we would simulate a large random sample Y, ..., Y from the same distribution 
as Y, and then compute the sample mean Z as our estimator. We can also estimate a 
quantile 6, of a distribution in a similar fashion. If Y ®,..., Y™ isalarge sample from 
the distribution, we can compute the sample p quantile Z. It is always a good idea to 
compute some measure of how good a simulation estimator is. One common measure 
is the simulation standard error of Z, an estimate of the standard deviation of the 
simulation distribution of Z. Alternatively, one could perform enough simulations 
to make sure that the probability is high that the Z is close to the parameter being 
estimated. 


1. Eq. (12.2.4) is based on the assumption that Z has ap- 
proximately a normal distribution. Occasionally, the nor- 
mal approximation is not good enough. In such cases, one 
can let 


o2 


——— ws 12.2.6 
"= 20—y) ee) 


To be precise, let Z be the average of v independent 
random variables with mean yw and variance o2. Prove that 
if v is at least as large as the number in Eq. (12.2.6), then 
Pr(|Z — | <c) > y. Hint: Use the Chebyshev inequal- 
ity (6.2.3). 


2. In Example 12.2.11, how large would v need to be 
according to Eq. (12.2.6)? 


3. Suppose that we have available as many i.i.d. standard 
normal random variables as we desire. Let X stand for 
a random variable having the normal distribution with 
mean 2 and variance 49. Describe a method for estimating 
E(log(|X| + 1)) using simulation. 


4. Use a pseudo-random number generator to simulate a 
sample of 15 independent observations in which 13 of the 


15 are drawn from the uniform distribution on the interval 
[—1, 1] and the other two are drawn from the uniform 
distribution on the interval [—10, 10]. For the 15 values 
that are obtained, calculate the values of (a) the sample 
mean, (b) the trimmed means for k = 1, 2, 3, and 4 (see 
Sec. 10.7), and (c) the sample median. Which of these 
estimators is closest to 0? 


5. Repeat Exercise 4 ten times, using a different pseudo- 
random sample each time. In other words, construct 10 
independent samples, each of which contains 15 observa- 
tions and each of which satisfies the conditions of Exer- 
cise 4. 


a. For each sample, which of the estimators listed in 
Exercise 4 is closest to 0? 


b. For each of the estimators listed in Exercise 4, deter- 
mine the square of the distance between the estima- 
tor and 0in each of the 10 samples, and determine the 
average of these 10 squared distances. For which of 
the estimators is this average squared distance from 
0 smallest? 


6. Suppose that X and Y are independent, that X has the 
beta distribution with parameters 3.5 and 2.7, and that Y 
has the beta distribution with parameters 1.8 and 4.2. We 
are interested in the mean of X/(X + Y). You may assume 
that you have the ability to simulate as many random 
variables with whatever beta distributions you wish. 


a. Describe a simulation plan that will produce a good 
estimator of the mean of X/(X + Y) if enough simu- 
lations are performed. 


b. Suppose that you want to be 98 percent confident 
that your estimator is no more than 0.01 away from 
the actual value of E[X/(X + Y)]. Describe how you 
would determine an appropriate size for the simula- 
tion. 


7. Consider the numbers in Table 10.40 on page 676. Sup- 
pose that you have available as many standard normal 
random variables and as many uniform random variables 
on the interval [0, 1] as you desire. You want to perform a 
simulation to obtain the number in the “Sample median” 
row and € = 0.05 column. 


a. Describe how to perform such a simulation. Hint: 
Let X and U be independent such that X has the 
standard normal distribution and U has the uniform 
distribution on the interval [0, 1]. Let 0 < « <1, and 
find the distribution of 


r={7 
10x 


b. Perform the simulation on a computer. 


ifU >e, 
if U <e. 


8. Consider the same situation described in Exercise 7. 
This time, consider the number in the “Trimmed mean for 
k =2” row and e = 0.1 column. 


a. Describe how to perform a simulation to produce 
this number. 


b. Perform the simulation on a computer. 


9. In Example 12.2.12, we can actually compute the me- 
dian @ of the distribution of the X; in closed form. Calcu- 
late the true median, and see how far the simulated value 
was from the true value. Hint: Find the marginal p.d.f. of X 
by using the law of total probability for random variables 
(3.6.12) together with Eq. (5.7.10). The c.d.f. and quantile 
function are then easy to derive. 


10. Let Xj, ..., X21 be iid. with the exponential distri- 
bution that has parameter 4. Let M stand for the sample 
median. We wish to compute the M.S.E. of M as an esti- 
mator of the median of the distribution of the X;’s. 


a. Determine the median of the distribution of X1. 


b. Let 6 be the M.S.E. of the sample median when A = 1. 
Prove that the M.S.E. of the sample median equals 
6/2 in general. 

c. Describe a simulation method for estimating 0. 
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11. In Example 12.2.4, there is a slightly simpler way to 
simulate a sample from the posterior distribution of uw, — 
4. Suppose that we can simulate as many independent 
t pseudo-random variables as we wish with whatever de- 
grees of freedom we want. Explain how we could use these 
t random variables to simulate a sample from the posterior 
distribution of w, — py. 


12. Let (Y), Wi), .... (Y,, W,) be an i.i.d. sample of ran- 
dom vectors with finite covariance matrix 


y= ( 9yy  Oyw ) 
Cyw Sww 


Let Y and W be the sample averages. Let g(y, w) be a 

function with continuous partial derivatives g, and g> with 

respect to y and w, respectively. Let Z = g(Y, W). The 

two-dimensional Taylor expansion of g around a point 
(Yo, Wo) is 

&(y, W) = 8(N05 Wo) + 8100» Wo)(Y — Yo) 

+ 82(Yo, Wo)(w — wo), (12.2.7) 


plus an error term that we shall ignore here. Let (y, w) = 
(Y, W) and (yg, wo) = (E(Y), E(W)) in Eq. (12.2.7). To the 
level of approximation of Eq. (12.2.7), prove that 


Var(Z) = g(E(Y), E(W)Y’oyy 
+ 28(E(Y), E(W))g(E(Y), E(W))oy. 


+ g(E(Y), E(W)) oy: 


Hint: Use the formula for the variance of a linear combi- 
nation of random variables derived in Sec. 4.6. 


13. Use the two-dimensional delta method from Exer- 
cise 12 to derive the estimator of the simulation variance 
of asample variance as given in Eq. (12.2.3). Hint: Replace 
E(Y) and E(W) by ¥ and W, respectively, and replace © 
by the sample variances and sample covariance. 


14. Let Y be a random variable with some distribution. 
Suppose that you have available as many pseudo-random 
variables as you want with the same distribution as Y. 
Describe a simulation method for estimating the skewness 
of the distribution of Y. (See Definition 4.4.1.) 


15. Suppose that the price of a stock at time wu in the 
future is a random variable S$, = Spe%“*+"«, where Sp is 
the current price, a is a constant, and W, is a random 
variable with known distribution. Suppose that you have 
available as many 1.i.d. random variables as you wish with 
the distribution of W,,. Suppose that the m.g.f. w(t) of W,, 
is known and finite on an interval that contains rf = 1. 


a. What number should a equal in order that E(S,,) = 
e" So? 

b. We wish to price an option to purchase one share of 
this stock at time u for the price g. Describe how you 
could use simulation to estimate the price of such an 
option. 
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16. Consider a queue to which customers arrive according 
to a Poisson process with rate 4 per hour. Suppose that the 
queue has two servers. Each customer who arrives at the 
queue counts the length r of the queue (including any cus- 
tomers being served) and decides to leave with probability 
p,, for r =2,3,.... A customer who leaves does not en- 
ter the queue. Each customer who enters the queue waits 
in the order of arrival until at least one of the two servers 
is available, and then begins being served by the available 


server. If both servers are available, the customer chooses 
randomly between the two servers with probability 1/2 
for each, independent of all other random variables. For 
server i (i = 1, 2), the time (in hours) to serve a customer, 
after beginning service, is an exponential random variable 
with parameter j4;. Assume that all service times are in- 
dependent of each other and of all arrival times. Describe 
how to simulate the number of customers in the queue 
(including any being served) at a specific time rf. 


Example 
12.3.1 


12.3 Simulating Specific Distributions 


In order to perform statistical simulations, we must be able to obtain pseudo- 
random values from a variety of distributions. In this section, we introduce some 
methods for simulating from specific distributions. 


Most computer packages with statistical capability are able to generate pseudo- 
random numbers with the uniform distribution on the interval [0, 1]. We shall assume 
throughout the remainder of this section that one has available an arbitrarily large 
sample of what appear to be i.i.d. random variables (pseudo-random numbers) with 
the uniform distribution on the interval [0, 1]. Usually, we need random variables 
with other distributions, and the purpose of this section is to review some common 
methods for transforming uniform random variables into random variables with 
other distributions. 


The Probability Integral Transformation 


In Chapter 3, we introduced the probability integral transformation for transforming 
a uniform random variable X on the interval [0, 1] into a random variable Y with a 
continuous strictly increasing c.d.f. G. The method is to set Y = G~!(X). This method 
works well if G~! is easily computed. 


Generating Exponential Pseudo-Random Variables. Suppose that we want Y to have the 
exponential distribution with parameter 4, where A is a known constant. The c.d.f. of 
Y is 


1-e* ify>0, 
Gy) = ne 
0 if y <0. 
We can easily invert this function to obtain 
G(x) =—-logd—x)/r, if0<x<1. 


If X has the uniform distribution on the interval [0, 1], then — log(1 — X)/A has the 
exponential distribution with parameter 4. 4 


Special-Purpose Algorithms 


There are cases in which the desired c.d.f. G is not easy to invert. For example, if G is 
the standard normal c.d.f., then G~! must be obtained by numerical approximation. 


Example 
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However, there is a clever method for transforming two independent uniform ran- 
dom variables on the interval [0, 1] into two standard normal random variables. The 
method was described by Box and Miiller (1958). 


Generating Two Independent Standard Normal Variables. Let X,, X> be independent 
with the uniform distribution on the interval [0, 1]. The joint p.d.f. of (X1, X>) is 


f(y, %2) =1, for0<24, x. <1. 
Define 
Y, =[-2 log(X,)]'/? sin(27 X>), 
Y, = [—2 log(X,)]"”* cos(2x X>). 
The inverse of this transformation is 
X; = exp[—(Y} + ¥3)/2} 
X= = arctan(Y;/Y>). 
Using the methods of Sec. 3.9, we compute the Jacobian, which is the determinant 


of the matrix of partial derivatives of the inverse function: 


( —y, exp[—(9? + y3)/2] —y2 exp[—Q7 + ee) 


ee ne ar? 
2my2 1+(y1/y2)? any; 1(ni/y2)° 


The determinant of this matrix is J = exp[—(y + y3)/2\/(2r). The joint p.d.f. of 
(Y1, Y>) is then 


sO ¥0) = f (expl-(? + ¥3)/2}, aretan(y1/y2)/@)) |J| 


= exp[—(} + y3)/2]/(2z). 


This is the joint p.d.f. of two independent standard normal variables. < 


Acceptance/Rejection 


Many other special-purpose methods exist for other distributions, also. We would like 
to present here one more general-purpose method that has wide applicability. The 
method is called acceptance/rejection. Let f be a p.d.f. and assume that we would like 
to sample a pseudo-random variable with this p.d.f. Assume that there exists another 
p.d.f. g with the following two properties: 


¢ We know how to simulate a pseudo-random variable with p.d.f. g. 
¢ There exists a constant k such that kg(x) > f(x) for all x. 


To simulate a single Y with p.d-f. f, perform the following steps: 


1. Simulate a pseudo-random X with p.d-f. g and an independent uniform pseudo- 
random variable U on the interval [0, 1]. 


2. If 
LOO 5 ku, (12.3.1) 
g(X) 
let Y = X, and stop the process. 
3. If (12.3.1) fails, throw away X and U, and return to the first step. 
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Theorem 
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If we need more than one Y, we repeat the entire process as often as needed. We 
now show that the p.d.f. of each Y is f. 


The p.d.f. of Y in the acceptance/rejection method is f. 


Proof First, we note that the distribution of Y is the conditional distribution of 
X given that (12.3.1) holds. That is, let A be the event that (12.3.1) holds, and let 
h(x, u|A) be the conditional joint p.d.f. of (X, U) given A. Then the p.d.f. of Y is 
J h(x, uA) du. This is because Y is constructed to be X conditional on (12.3.1) 
holding. The conditional p.d.f. of (X, U) given A is 


1 i if f(x)/g(x) => ku and0O <u <1, 
Pr(A) 
It is straightforward to calculate Pr(A), that is, the probability that U < f(X)/[kg(X)]. 


cop f(x)/[kg@)] oO 4 1 
Pr(A) =| / g(x) dudx = = f(x) dx= -. 
—coo J0 —00o k k 


h(x, u|A) = 
( IA) 0 otherwise. 


So, 


h(x, u|A) =k g(x) if F(x)/a(e) >kuand0<u <1, 
0 otherwise. 


The integral of this function over all u values for fixed x is the p.d.f. of Y evaluated 
at x: 


fx)/Tkg x] 
[nce uay du=k | g(x) du= f(x). = 
0 
Here is an example of the use of acceptance/rejection. 


Simulating a Beta Distribution. Suppose that we wish to simulate a random variable Y 
having the beta distribution with parameters 1/2 and 1/2. The p.d-f. of Y is 


Oe ee eee ea 
1 
Note that this p.d.f. is unbounded. However, it is easy to see that 
1%. 3 = 
fo) S=0 eda-y'), (12.3.2) 


for all 0 < y <1. The right side of Eq. (12.3.2) can be written as kg(y) with k = 4/z 
and 


w=5] 1 n 1 
8 9 2yl/2 2a —y)V/2 | 


This g is a half-and-half mixture of two p.d-f.’s g; and go: 


forO0 <x <1, 


1 
g(x) = De hf2 


1 
x) = ———_.., for0<x <1. 12.3.3 
82(X) TESTE ( ) 
We can easily simulate random variables from these distributions using the probabil- 
ity integral transformation. To simulate a random variable X with p.d-f. g, simulate 
three random independent variables U,, U7, U3 with uniform distributions on the 
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interval [0, 1]. If Uj < 1/2, simulate X from g; using the probability integral transfor- 
mation applied to U;. If U; > 1/2, simulate X from g» using the probability integral 
transformation and U. If f(X)/g(X) = kU3, let Y = X. Ifnot, repeat the process. < 


When using the acceptance/rejection method, one must usually reject simulated 
values and resimulate. The probability of accepting a value is Pr(A) in the proof 
of Theorem 12.3.1, namely, 1/k. The larger k is, the harder it will be to accept. 
In Exercise 5, you will prove that the expected number of iterations until the first 
acceptance is k. 

A common special case of acceptance/rejection is the simulation of a random 
variable conditional on some event. For example, let X be a random variable with 
the p.d.f. g, and suppose that we want the conditional distribution of X given that 
X > 2. Then the conditional p.d-f. of X given X > 2 is 


kg(x) ifx>2, 
foy={ 5 ifx <2, 
where k = 1/ (pag g(x) dx. Note that f(x) < kg(x) for all x, so acceptance/rejection 
is applicable. In fact, since f(X)/g(X) only takes the two values k and 0, we don’t 
need to simulate the uniform U in the acceptance/rejection algorithm. We don’t even 
need to compute the value k. We just reject each X < 2. Here is a version of the same 
algorithm to solve a question that was left open in Sec. 11.8. 


Computing the Size of a Two-Stage Test. In Sec. 11.8, we studied the analysis of data 
from a two-way layout with replication. In that section, we introduced a two-stage 
testing procedure. First, we tested the hypotheses (11.8.11), and then, if we accepted 
the null hypothesis, we proceeded to test the hypotheses (11.8.13). Unfortunately, 
we were unable to compute the conditional size of the second test given that the first 
test accepted the null hypothesis. That is, we could not calculate (11.8.15) in closed 
form. However, we can use simulation to estimate the conditional size. 
The two tests are based on Usa: defined in Eq. (11.8.12), and V2, defined in 
Eq. (11.8.16). The first test rejects the null hypothesis in (11.8.11) if Ue > d, where 
d is a quantile of the appropriate F distribution. The second test rejects its null 
hypothesis if V7 > c, where c is yet to be determined. The random variables ae B 
and a are both ratios of various mean squares. In particular, they share a common 
denominator MSpesiq = SF aa [I J(K — 1)]. In order to determine an appropriate 
critical value c for the second test, we need the conditional distribution of Ve given 
that U a 3 <4, and given that both null hypotheses are true. We can sample from 
that conditional distribution as follows: Let the interaction mean square be MS 43 = 
s2 IG —1)(J — 1)], and let the mean square for factor A be MS 4 = $2/C — 1). Then 
Us = MS 4p/MSpesiq and i = MS 4/MSpesiq- All of these mean squares are inde- 
pendent, and they all have different gamma distributions when the null hypotheses 
are both true. Most statistical computer packages will allow simulation of gamma 
random variables. So, we start by simulating many triples (MS 4p, MSpesia, MS). 
Then, for each simulated triple, we compute ag; z and Vi. If Ue p= 4d, we discard the 
corresponding vi. The undiscarded vi values are a random sample from the condi- 
tional distribution that we need. The efficiency of this algorithm could be improved 
slightly by simulating MS, and then computing A only when U yi 3 <4 1s observed. 
< 
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Generating Functions of Other Random Variables 


It often happens that there is more than one way to simulate from a particular 
distribution. For example, suppose that a distribution is defined as the distribution 
of a particular function of other random variables (in the way that the x7, t, and F 
distributions are). In such cases, there is a straightforward way to simulate the desired 
distribution. First, simulate the random variables in terms of which the distribution 
is defined, and then calculate the appropriate function. 


Alternate Method for Simulating a Beta Distribution. In Exercise 6 in Sec. 5.8, you 
proved the following: If U and V are independent, with U having the gamma distribu- 
tion with parameters a, and 8, and V having the gamma distribution with parameters 
ay and f, then U/(U + V) has the beta distribution with parameters a, and a». So, 
if we have a method for simulating gamma random variables, we can simulate beta 
random variables. The case handled in Example 12.3.3 is ay = ay = 1/2. Let B = 1/2 
so that U and V would both have gamma distributions with parameters 1/2 and 1/2, 
also known as the x? distribution with one degree of freedom. If we simulate two in- 
dependent standard normal random variables X,, X> (for example, by the method of 
Example 12.3.2), then X ? and x3 are independent and have the x? distribution with 
one degree of freedom. It follows that Y = X : [(X : + x3) has the beta distribution 
with parameters 1/2 and 1/2. 4 


As another example, to simulate a y* random variable with 10 degrees of free- 
dom, one could simulate 10 i.i.d. standard normals, square them, and add up the 
squares. Alternatively, one could simulate five random variables having the expo- 
nential distribution with parameter 1/2 and add them up. 


Generating Pseudo-Random Bivariate Normal Vectors. Suppose that we wish to sim- 
ulate a bivariate normal vector with the p.d.f. given in Eq. (5.10.2). This p.d.f. was 
constructed as the joint p.d.f. of 


X= 012, + M4, 
Xp =0n[ 02 + = p?)'7Z)| + (12.3.4) 


where Z, and Z) arei.i.d. with the standard normal distribution. If we use the method 
of Example 12.3.2 to generate independent Z, and Z, with the standard normal 
distribution, we can use the formulas in (12.3.4) to transform these into X, and X, 
which will then have the desired bivariate normal distribution. J 


Most statistical computer packages have the capability of simulating pseudo- 
random variables with each of the continuous distributions that have been named 
in this text. The techniques of this section are really needed only for simulating less 
common distributions or when a statistical package is not available. 


Some Examples Involving Simulation of Common Distributions 


Bayesian Analysis of One-Way Layout. We can perform a Bayesian analysis of a one- 
way layout using the same statistical model presented in Sec. 11.6 together with 
an improper prior for the model parameters. (We could use a proper prior, but 
the additional calculations would divert our attention from the simulation issues.) 
Let t = 1/07, as we did in Sec. 8.6. The usual improper prior for the parameters 
(U1, .-+5 Mp, T) has “p.d.f.” 1/7. The posterior joint p.d.f. is then proportional to 1/t 
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times the likelihood. The observed data are y;; for j =1,...,n; andi=1,..., p. 
The likelihood function is 


PN 
= T 
(2x)! 7"/? exp 5 iG - Ha” . 


i=l j=l 


where n =n, +---+n,. To simplify the likelihood function, we can rewrite the sum 
of squares that appears in the exponent as 


| P 
2 = 2 2 
>) Op =n => Ou — BF Shas: 
i=1 j=l i=l 
where y,, is the average of yj1,..., Yin, and 


DN; 
2 = AD 
Ona » > Oj =a) 


i=l j=l 


is the residual sum of squares. Then, the posterior p.d-f. is proportional to 


p 
Tt = me as T 
Pl? exo(-5 > ni¥i4 — ui? a p)/2 ‘exp(—E Rau): 
i=1 


This expression is easily recognized as the product of the gamma p.d_f. for t with pa- 
rameters (n — p)/2 and Sale and the product of p normal p.d.f’s for 1, ..., Mp 
with means y;, and precisions n;t fori =1,..., p. Hence, the posterior joint distri- 
bution of the parameters is the following: Conditional on Tt, the jz;’s are independent 
with j1; having the normal distribution with mean y;, and precision n;t. The marginal 
distribution of t is the gamma distribution with parameters (nm — p)/2 and S. esid/ 2" 
If we simulate a large sample of parameters from the posterior distribution, we 
could begin to answer questions about what we have learned from the data. To do 
this, we would first simulate a large number of t values rt, ..., t. Most statistical 
programs allow the user to simulate gamma random variables with arbitrary first 
parameter and second parameter 1. So, we could simulate T™,..., 7 having 
the gamma distribution with parameters (n — p)/2 and 1. We could then let 1 = 
(6) () 


2T/S2 iq for €=1,..., v. Then, for each ¢ simulate independent ;’,..., Be 


with ne having the normal distribution with mean y,, and variance 1/[n;7]. 

As a specific example, consider the hot dog data in Example 11.6.2. We begin 
by simulating v = 60,000 sets of parameters as described above. Now we can address 
the question of how much difference there is between the means. There are several 
ways to do this. We could compute the probability that all |“; — “;| > c for each 
positive c. We could compute the probability that at least one |; — | > c for each 
positive c. We could compute the quantiles of max; ;|“; — w;|, of min; ;|4; — /;|, OT 
of the average of all |; — |. For example, in 99 percent of the 60,000 simulations, 
at least one | me _ ee | > 27.94. The simulation standard error of this estimator of 
the 0.99 quantile of max, ; |“; — 2;| is 0.1117. (For the remainder of this example, we 


shall present only the simulation estimates and not their simulation standard errors.) 
In about 1/2 of the simulations, all | wo? - je > 2.379. And in 99 percent of the 
simulations, the average of the differences was at least 14.59. Whether 27.94, 14.59, 
or 2.379 count as large differences depends on what decisions we need to make about 
the hot dogs. A useful way to summarize all of these calculations is through a plot of 
the sample c.d.f’s of the largest, smallest, and average of the six |; — y ;| differences. 


(The sample c.d-f. of a set of numbers is defined at the very beginning of Sec. 10.6.) 
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Figure 12.4 Sample c.d.f’s 
of the maximum, average, 
and minimum of the six 
|; — #;| differences for 
Example 12.3.7. 
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Table 12.4 Posterior probabilities that each yu; is largest and 
smallest in Example 12.3.7 


Type Beef Meat Poultry Specialty 
i 1 2 3 4 
Pr(y; largest|y) 0.1966 0.3211 0 0.4823 
Pr(; smallest|y) 0 0 1 0 


Figure 12.4 contains such a plot for this example. If we are simply concerned with 
whether or not there are any differences at all between the four types of hot dogs, 
then the “Maximum” curve in Fig. 12.4 is the one to examine. (Can you explain why 
this is the case?) 

We can also attempt to answer questions that we would have great difficulty 
addressing in the ANOVA framework of Chapter 11. For example, we could ask 
what is the probability that each jz; is the largest or smallest of the four. For each i, 


let N; be the number of simulations j such that ue is the smallest of iy ) ean ie 


Also let M; be the number of simulations j such that pe ) is the largest of the four 
means. Then WN, /60,000 is our simulation estimate of the probability that jy; is the 
smallest mean, and M, /60,000 is our estimate of the probability that j1; is the largest 
mean. The results are summarized in Table 12.4. We see that j13 is almost certainly 


the smallest, while jz4 has almost a 50 percent chance of being the largest. < 


Comparing Copper Ores. We shall illustrate the method of Example 12.2.4 using the 
data on copper ores from Example 9.6.5. Suppose that the prior distributions for 
all parameters are improper. The observed data consist of one sample of size 8 
and another sample of size 10 with X = 2.6, eee — X)* =0.32, Y =2.3, and 
Bear 6g iio Y)* =0.22. The posterior distributions then have hyperparameters jz, = 
2.6, Ay = 8, aAy= 3.5, Byy = 0.16, My, = LTS: Ay = 10, ayy = 4.5, and By1 = 0.11. 
The posterior distributions of t, and t, are, respectively, the gamma distribution 
with parameters 3.5 and 0.16 and the gamma distribution with parameters 4.5 and 


Figure 12.5 Histogram of 

simulated yw, — (4, values to- 
gether with posterior c.d.f. of 
|/Ly — Ly| for Example 12.3.8. 
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0.11. We can easily simulate, say, 10,000 pseudo-random values from each of these 
two distributions. For each simulated t,, we simulate a ju, that has the normal 
distribution with mean 2.6 and variance 1/(8t,). For each simulated t,, we simulate a 
i, that has the normal distribution with mean 2.3 and variance 1/(10z,). Figure 12.5 
contains a histogram of the 10,000 simulated jz, — jz, values together with the sample 
c.d.f. of |w, — |. It appears that 4, — wy is almost always positive; indeed, it was 
positive for over 99 percent of the sampled values. The probability is quite high that 
[Ly — My| < 0.5, so that if 0.5 is not a large difference in this problem, we can be 
confident that j, and jz, are pretty close. On the other hand, if 0.1 is a large difference, 
we can be confident that jz, and ju, are pretty far apart. J 


If all we care about in Example 12.3.8 is the distribution of 1, — y,, then we could 
simulate jz, and ju, directly without first simulating t, and t,. Since yw, and jy are 
independent in this example, we could simulate each of them from their respective 
marginal distributions. 


Power of the ¢ Test. In Theorem 9.5.3, we showed how the power function of the t 
test can be computed from the noncentral ¢ distribution function. Not all statistical 
packages compute noncentral t probabilities. We can use simulation to estimate these 
probabilities. Let Y have the noncentral r distribution with m degrees of freedom and 
noncentrality parameter y. Then Y has the distribution of X,/(X>/m)!/* where X, 
and X> are independent with X, having the normal distribution with mean y and 
variance 1 and X, having the x? distribution with m degrees of freedom. A simple 
way to estimate the c.d-f. of Y is to simulate a large number of (X,, X2) pairs and 
compute the sample c.d.f. of the values of X,/(X/m)'/. < 


The Simulation Standard Error of a Sample e.d.f In Examples 12.3.7 and 12.3.8, 
we plotted the sample c.d.f.’s of functions of simulated data. We did not associate sim- 
ulation standard errors with these functions. We could compute simulation standard 
errors for every value of the sample c.d.f., but there is a simpler way to summa- 
rize the uncertainty about a sample c.d.f. We can make use of the Glivenko-Cantelli 
lemma (Theorem 10.6.1). To summarize that result in the context of simulation, let 
Y, (i =1,..., v) be a simulated iid. sample with c.d.f. G. Let G, be the sample 
c.d.f. For each real x, G,(x) is the proportion of the simulated sample that is less 
than or equal to x. That is, G,(x) is 1/v times the number of i’s such that Y < x. 
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Example 
12.3.11 


Theorem 10.6.1 says that if v is large, then 
Pr(16,00 ~ Gi) < —, for atlx) ~ H(t), 
wile 


where H is the function in Table 10.32 on page 661. In particular, with f = 2, H(t) = 
0.9993. So we can declare (at least approximately) that |G,(x) — G(x)| < 2/v! si- 
multaneously for all x with probability 0.9993. In Example 12.3.7, we had v = 60, 000, 
so each curve in Fig. 12.4 should be accurate to within 0.008 with probability 0.9993. 
Indeed, all three curves simultaneously should be accurate to within 0.008 with prob- 
ability 0.9979. (Prove this in Exercise 14.) 


Simulating a Discrete Random Variable 


All of the examples so far in this section have concerned simulations of random 
variables with continuous distributions. Occasionally, one needs random variables 
with discrete distributions. Algorithms for simulating discrete random variables exist, 
and we shall describe some here. 


Simulating a Bernoulli Random Variable. It is simple to simulate a pseudo-random 
Bernoulli random variable X with parameter p. Start with U having the uniform 
distribution on the interval [0, 1], and let X = 1 if U < p. Otherwise, let X = 0. Since 
Pr(U < p) = p, X has the correct distribution. This method can be used to simulate 
from any distribution that is supported on only two values. If 


Dp ifx =h, 
f@)=41-p ifx=b, 
0 otherwise, 
then let X = t, if U < p, and let X = ty otherwise. 4 


Simulating a Discrete Uniform Random Variable. Suppose that we wish to simulate 
pseudo-random variables from a distribution that has the p.f. 


f= | ee (hy ces (12.3.5) 
0 otherwise. 
The uniform distribution on the integers 1, ..., 7 isan example of such a distribution. 


A simple way to simulate a random variable with the p.f. (12.3.5) is the following: Let 
U have the uniform distribution on the interval [0, 1], and let Z be the greatest integer 
less than or equal to nU +1. It is easy to see that Z takes the values 1, ..., with 
equal probability, and so X = tz has the p.f. (12.3.5). < 


The method described in Example 12.3.11 does not apply to more general dis- 
crete distributions. However, the method of Example 12.3.11 is useful in simulations 
that are done in bootstrap analyses described in Sec. 12.6. 

For general discrete distributions, there is an analog to the probability integral 
transformation. Suppose that a discrete distribution is concentrated on the values 
ty <-+--<t, and that the c.d-f. is 


0 ifx<h, 
F(x)= qi if ¢; <x <Fi4t1 fori =1, ...,n-l, (12.3.6) 
1 ifx>tf,. 


Example 
12.3.12 


Example 
12.3.13 
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The following is the quantile function from Definition 3.3.2: 


ty if0<p<q, 
F-\(p)= tiag fg; < p< 4j41,fori=1,...,n—-2, (12.3.7) 
th if q,-1< p<. 


You can prove (see Exercise 13) that if U has the uniform distribution on the interval 
[0, 1], then F~!(U) has the c.d.f. in Eq. (12.3.6). This gives a straightforward, but 
inefficient, method for simulating arbitrary discrete distributions. Notice that the 
restriction that n be finite is not actually necessary. Even if the distribution has 
infinitely many possible values, F~! can be defined by (12.3.7) by replacing n — 2 
by oo and removing the last branch. 


Simulating a Geometric Random Variable. Suppose that we wish to simulate a pseudo- 
random X having the geometric distribution with parameter p. In the notation of 
Eq. (12.3.7), t, =i —1fori=1,2,..., and gq; =1— (1 — p)'. Using the probability 
integral transformation, we would first simulate U with the uniform distribution on 
the interval [0, 1]. Then we would compare U to q; fori =1, 2, ..., until the first time 
that g; < U and set X =7. In this example, we can avoid the sequence of comparisons 
because we have a simple formula for q;. The first i such that g; < U is the greatest 
integer strictly less than log(1 — U)/ log(1 — p). 4 


The probability integral transformation is very inefficient for discrete distribu- 
tions that do not have a simple formula for g; if the number of possible values is large. 
Walker (1974) and Kronmal and Peterson (1979) describe a more efficient method 
called the alias method. The alias method works as follows: Let f be the p.f. from 
which we wish to simulate a random variable X. Suppose that f(x) > 0 for only n 
different values of x. First, we write f as an average of n p.f’s that are concentrated 
on one or two values each. That is, 


1 
f(x) = —[g1(x) +--+ + 8,(x)], (12.3.8) 


n 


where each g; is the p.f. of a distribution concentrated on one or two values only. We 
shall show how to do this in Example 12.3.13. To simulate X, first simulate an integer 
T that has the uniform distribution over the integers 1, ...,. (Use the method of 
Example 12.3.11.) Then simulate X from the distribution with the p.f. g;. The reader 
can prove in Exercise 17 that X has the p.f. f. 


Simulating a Binomial Random Variable Using the Alias Method. Suppose that we need 
to simulate many random variables with a binomial distribution having parameters 
9 and 0.4. The p.f. f of this distribution is given in a table at the end of this book. 
The distribution has n = 10 different values with positive probability. Since the n 
probabilities must add to 1, there must be x, and y, such that f(x,) <1/n and 
fp = 1/n. For example, x, = 0 and y, = 2 have f(x;) = 0.0101 and f(y) = 0.1612. 
Define the first two-point p.f., g1, as 


nf (x4) if x =x, 
8ix)=41-nf(y) ifx=y, 
0 otherwise. 
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In our case, g1(0) = 0.101 and g9;(2) = 0.899. We then write f as f(x) = g4(x)/n+ 
fi (x), where 


0 if x =x, 
ff@=4 fOd-sidp/n ifx=y, 
f() otherwise. 


In our example, f/*(2) = 0.0713. Now, f;* is positive at only n — 1 different values, 
and the sum of the positive values of f,* is (n — 1)/n. Hence, there must exist x2 
and y, such that f/*(x2) < 1/n and f;*(y2) = 1/n. For example, x. = 2 and y2 = 3 have 
fi (42) = 0.0713 and f}*(y2) = 0.2508. Define g7 by 


nf} (x2) if x =X, 
82(x) = 4 l—nff(x2) ifx=yo, 
0 otherwise. 


Here, g7(2) = 0.713. Now write fi"(x) = g2(x)/n + f3(x), where 


0 if x =X, 
f5(x) = 4 FOr) — 8202)/n  ifx = yo, 
fi @) otherwise. 


In our example, f5(3) = 0.2221. Now, f3 takes only n — 2 positive values that add up 
to (n — 2)/n. We can repeat this process n — 3 more times, obtaining g1,..., g,—, and 

*_ Here, f*_,(x) takes only one positive value, at x = x,, say, and f*_,(x,) =1/n. 
Let g, be a degenerate distribution at x,. Then f(x) =[g,(x) +---+.g8,(x)]/n for all 
Xx. 

After all of this initial setup, the alias method allows rapid simulation from f as 
follows: Simulate independent U and J with U having the uniform distribution on the 
interval [0, 1] and J having the uniform distribution on the integers 1, ..., (n =10 
in our example). If U < g;(x,), set X =x,;.IfU > g;(x;), set X = y;. Here, the values 
we need to perform the simulation are 


i 1 2 3 4 3 6 7 8 9 10 
Xj 0 2 1 6 | 3 8 9 4 5 
Vj 2 3 3 3 3 4 4 4 5 _ 


8 (X;) 0.101 0.713 0.605 0.743 0.212 0.781 0.035 0.003 0.327 1 


There is even a clever way to replace the two simulations of U and / with a single 
simulation. Simulate Y with the uniform distribution on the interval [0, 1], and let / 
be the greatest integer less than or equal to nY + 1. Then let U =nY +1-—T. (See 
Exercise 19.) 

As an example, suppose that we simulate Y with the uniform distribution on the 
interval [0, 1], and we obtain Y = 0.4694. Then J = 5 and U = 0.694. Since 0.694 > 
85(X5) = 0.212, we set X = ys = 3. Figure 12.6 shows a histogram of 10,000 simulated 
values using the alias method. <1 


All of the overhead required to set up the alias method is worth the effort only if 
we are going to simulate many random variables with the same discrete distribution. 


12.3 Simulating Specific Distributions 815 


Figure 12.6 Histogram of A 
10,000 simulated binomial 
random variables in Exam- 2500: x 
ple 12.3.13. The X marks 
appear at heights equal to eal 
10,000 f (x) to illustrate the # as00-4 x 
close agreement of the simu- & 
lated and actual distributions. 
1000 + 
X 
soot 
a |, x 
° 2 4 6 8 x 
Summary 


We have seen several examples of how to transform pseudo-random uniform vari- 
ables into pseudo-random variables with other distributions. The acceptance/rejec- 
tion method is widely applicable, but it might require many rejected simulations for 
each accepted one. Also, we have seen how we can simulate random variables that 
are functions of other random variables (such as a noncentral t random variable). 
Several examples illustrated how we can make use of simulated random variables 
with some of the common distributions. Readers who desire a thorough treatment 
of the generation of pseudo-random variables with distributions other than uniform 


can consult Devroye (1986). 


Exercises 


1. Return to Exercise 10 in Sec. 12.2. Now that we know 
how to simulate exponential random variables, perform 
the simulation developed in that exercise as follows: 


a. Perform vp = 2000 simulations and compute both the 
estimate of 0 and its simulation standard error. 


b. Suppose that we want our estimator of 6 to be within 
0.01 of 6 with probability 0.99. How many simula- 
tions should we perform? 


2. Describe how to convert a random sample Uj, ..., U, 
from the uniform distribution on the interval [0, 1] to a 
random sample of size n from the uniform distribution on 
the interval [a, b]. 


3. Show how to use the probability integral transforma- 
tion to simulate random variables with the two p.d.f.’s in 
Eq. (12.3.3). 


4. Show how to simulate Cauchy random variables using 
the probability integral transformation. 


5. Prove that the expected number of iterations of the 
acceptance/rejection method until the first acceptance is 
k. (Hint: Think of each iteration as a Bernoulli trial. What 


is the expected number of trials (not failures) until the first 
success?) 


6.a. Show how to simulate a random variable having the 
Laplace distribution with parameters 0 and 1. The 
p.d.f. of the Laplace distribution with parameters 0 
and a is given in Eq. (10.7.5). 

b. Show how to simulate a standard normal random 
variable by first simulating a Laplace random vari- 
able and then using acceptance/rejection. Hint: Max- 
imize ee 2 /e—* for x => 0, and notice that both dis- 
tributions are symmetric around 0. 


7. Suppose that you have available as many i.i.d. standard 
normal pseudo-random numbers as you desire. Describe 
how you could simulate a pseudo-random number with an 
F distribution with four and seven degrees of freedom. 


8. Let X and Y be independent random variables with X 
having the ¢ distribution with five degrees of freedom and 
Y having the rf distribution with seven degrees of freedom. 
We are interested in E(|X — Y}). 
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a. Simulate 1000 pairs of (X;, Y;) each with the above 
joint distribution and estimate E(|X — Y|). 


b. Use your 1000 simulated pairs to estimate the vari- 
ance of |X — Y| also. 


c. Based on your estimated variance, how many sim- 
ulations would you need to be 99 percent confident 
that your estimator of E(|X — Y|) is within 0.01 of the 
actual mean? 


9. Show how to use acceptance/rejection to simulate ran- 
dom variables with the following p.d.f.: 


if0 <x <0.5, 

if0.5 <x <1.5, 
x ifl15<x <2, 

otherwise. 


& 


f@)= 


S wlowlwuis 
Gol 


10. Implement the simulation in Example 12.2.3 for the 
clinical trial of Example 2.1.4 on page 57. Simulate 5000 
parameter vectors. Use a prior distribution with ag = 1 
and fp = 1. Estimate the probability that the imipramine 
group has the highest probability of no relapse. Calculate 
how many simulations you would need to be 95 percent 
confident that your estimator is within 0.01 of the true 
probability. 


11. In Example 12.3.7, we simulated the rt values by first 
simulating gamma random variables with parameters (n — 
p)/2 and 1. Suppose that our statistical software allows us 
to simulate x2 random variables instead. Which x? dis- 
tribution should we use and how would we convert the 
simulated x?’s to have the appropriate gamma distribu- 
tion? 


12. Use the blood pressure data in Table 9.2 that was 
described in Exercise 10 of Sec. 9.6. Suppose now that 
we are not confident that the variances are the same for 
the two treatment groups. Perform a simulation of the 
sort done in Example 12.3.8 to obtain a sample from the 
posterior distribution of the parameters when we allow 
the variances to be unequal. 


a. Draw a plot of the sample c.d.f. of the absolute value 
of the difference between the two group means. 


b. Draw a histogram of the logarithm of the ratio of the 
two variances to see how close together they seem to 
be. 


13. Let F—! be defined as in Eq. (12.3.7). Let U have 
the uniform distribution on the interval [0, 1]. Prove that 
F~1(U) has the c.d.f. in Eq. (12.3.6). 


14. Refer to the three curves in Fig. 12.4. Call those 
three sample c.d.f’s G,1, Gy,2, and G, 3, and call the 
three c.d.f’s that they estimate G,, G2, and G3. Use the 
Glivenko-Cantelli lemma (Theorem 10.6.1) to show that 


Pr(|G,,;(x) — G;(x)| < 0.0082, for all x and all i) 


is about 0.9979 or larger. Hint: Use the Bonferroni in- 
equality (Theorem 1.5.8). 


15. Prove that the acceptance/rejection method works for 
discrete distributions. That is, let f and g be p.f’s rather 
than p.d.f’s, but let the rest of the acceptance/rejection 
method be exactly as stated. Hint: The proof can be trans- 
lated by replacing integrals over x by sums. Integrals over 
u should be left as integrals. 


16. Describe how to use the discrete version of the proba- 
bility integral transformation to simulate a Poisson 
pseudo-random variable with mean 6. 


17. Let f be a p-f., and assume that Eq. (12.3.8) holds, 
where each g; is another p.f. Assume that X is simu- 
lated using the method described immediately after Eq. 
(12.3.8). Prove that X has the pf. f. 


18. Use the alias method to simulate a random variable 
having the Poisson distribution with mean 5. Use the 
table of Poisson probabilities in the back of the book, and 
assume that 16 is the largest value that a Poisson random 
variable can equal. Assume that all of the probability not 
accounted for by the values 0,..., 15 is the value of the 
p.f. atk = 16. 


19. Let Y have the uniform distribution on the interval 
[0, 1]. Define J to be the greatest integer less than or equal 
tonY +1, and define U =nY +1-—T. Prove that J and U 
are independent and that U has uniform distribution on 
the interval [0, 1]. 


12.4 Importance Sampling 


Many integrals can usefully be rewritten as means of functions of random vari- 
ables. If we can simulate large numbers of random variables with the appropriate 
distributions, we can use these to estimate integrals that might not be possible to 


compute in closed form. 


Simulation methods are particularly well suited to estimating means of random vari- 
ables. If we can simulate many random variables with the appropriate distribution, 


Example 
12.4.1 
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we can average the simulated values to estimate the mean. Because means of ran- 
dom variables with continuous distributions are integrals, we might wonder whether 
other integrals can also be estimated by simulation methods. In principle, all finite 
integrals can be estimated by simulation, although some care is needed to insure that 
the simulation results have finite variance. 

Suppose that we wish to calculate i g(x) dx for some function g with a and b 
both finite. We can rewrite this integral as 


b b 
i g(x) dx -|/ (b — a)g(x) : dx = E[(b — a)g(X)], (12.4.1) 
a a b-a 

where X is a random variable with the uniform distribution on the interval [a, b]. 
A simple Monte Carlo method is to simulate a large number of pseudo-random 
values X1,..., X, with the uniform distribution on the interval [a, b]and estimate the 
integral by no oj) &(X;). The method just described has two commonly recognized 
drawbacks. First, it cannot be applied to estimate integrals over unbounded regions. 
Second, it can be very inefficient. If g is much larger over one portion of the interval 
than over another, then the values g(X;) will have large variance, and it will take a 
very large value v to get a good estimator of the integral. 

A method that attempts to overcome both of the shortcomings just mentioned 
is called importance sampling. The idea of importance sampling is to do something 
very much like what we did in Eq. (12.4.1). That is, we shall rewrite the integral as the 
mean of some function of X, where X has a distribution that we can simulate easily. 

Suppose that we are able to simulate a pseudo-random variable X with the p.d.f. 
f where f(x) > 0 whenever g(x) > 0. Then we can write 


/ g(x) dx= oo F0) dx = E(Y), (12.4.2) 


where Y = g(X)/f(X). (If f(x) =0 for some x such that g(x) > 0, then the two 
integrals in Eq. (12.4.2) might not be equal.) If we simulate v independent val- 
ues X1,..., X, with the p.d-f. f, we can estimate the integral by 1 > j-1 Y; where 
Y; = g(X;)/f (X;). The p.d-f. f is called the importance function. It is acceptable, al- 
though inefficient, to have f(x) > 0 for some x such that g(x) = 0. The key to efficient 
importance sampling is choosing a good importance function. The smaller the vari- 
ance of Y, the better the estimator should be. That is, we would like g(X)/f(X) to be 
close to being a constant random variable. 


Choosing an Importance Function. Suppose that we want to estimate i e*/ (A+ 
x*)dx. Here are five possible choices of importance function: 


fox) =1, for0<x <1, 

fAW=e%, for0<x <0, 

fo) =4+2")'/a, for -—co <x <0, 
f(x) =e*/A—e}, for0<x <1, 
falx) =40 42°) '/n, forO<x <1. 


Each of these p.d.f.’s is positive wherever g is positive, and each one can be simulated 
using the probability integral transformation. As an example, we have simulated 
10,000 uniforms on the interval [0, 1], U, ..., VU“, We then applied the five 
probability integral transformations to this single set of uniforms so that our com- 
parisons do not suffer from variation due to different underlying uniform samples. 
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Table 12.5 Monte Carlo estimates and 6; for Example 12.4.1 


j 0 1 ,) 3 4 
Y; 0.5185 0.5110 0.5128 0.5224 0.5211 
6, 0.2440 0.4217 0.9312 0.0973 0.1409 


J 


Since the five p.d.f’s are positive over different ranges, we should define 


gore | e*/0+x*). if0<x<1, 

0 otherwise. 
Let F; stand for the c.d.f. corresponding to f;, and let xe = F;'(U®) for i= 
1), 50310) 000 and. f= Csi. A Let yr? = 3(X)/f (XY). Then we obtain five 
different estimators of { g(x) dx, namely, 


10,000 


Y : DP Pet SO eet 
i=l 


i ~ 10,000 


For each j, we also compute the sample variance of the Y : values, 


The simulation standard error of Y; is ¢;/100. We list the five estimates together 
with the corresponding values of o; in Table 12.5. The estimates are relatively close 
together, but some values of 6; are almost 10 times others. This can be understood 
in terms of how well each f; approximates the function g. First, note that the two 


worst cases are those in which f; is positive on an unbounded interval. This causes 
us to simulate a large number of X so values for which g(X ) = 0 and hence Y - =0. 
This is highly inefficient. For example, with j = 2,75 percent of the X S values are 
outside of the interval (0, 1). The remaining oi values must be very large in order 


for the average to come out near the correct answer. In other words, because the ry 
values are so spread out (they range between 0 and 7), we get a large value of 6). On 
the other hand, with j = 3, there are no 0 values for Y » Indeed, the Y Ms values only 
range from 0.3161 to 0.6321. This allows 63 to be quite small. The goal in choosing an 
importance function is to make the Y values have small variance. This is achieved 
by making the ratio g/f as close to constant as we can. S| 


Calculating a Mean with No Closed-Form Expression. Let X have the gamma distribu- 
tion with parameters a and 1. Suppose that we want the mean of 1/(1 + X + X?). We 
might wish to think of this mean as 


ad 1 
oo = dx, 
[ lax pane x 


where f, is the p.d.f. of the gamma distribution with parameters a and 1. If a is 
not small, f(x) is close to 0 near x = 0 and is only sizeable for x near a. For large x, 


(12.4.3) 


Example 
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1/1 +x +x’) isalotlike 1/x?. Ifa and x are both large, the integrand in (12.4.3) is ap- 
proximately x~? f, (x). Since x~? f, (x) isa constant times f,_>(x), we could do impor- 
tance sampling with importance function f,_». For example, with w = 5, we simulate 
10,000 pseudo-random variables X,... , X10 having the gamma distribution 
with parameters 3 and 1. The sample mean of [1/(1 + X + X?)] f5(X)/fa(X) 
is 0.05184 with sample standard deviation 0.01465. For comparison, we also simulate 
10,000 pseudo-random variables Y™, ..., Y°° with the gamma distribution hav- 
ing parameters 5 and 1. The average of the values of 1/(1+ Y® + ¥7) is 0.05226 
with sample standard deviation 0.05103, about 3.5 times as large as we get using the 3 
importance function. With a = 3, however, the two methods have nearly equal sam- 
ple standard deviations. With a = 10, the importance sampling has sample standard 
deviation about one-tenth as large as sampling directly from the distribution of X. 
As we noted earlier, when a is large, 1/x? is a better approximation to 1/(1 + x + x7) 
than it is when a is small. < 


Bivariate Normal Probabilities. Let (X,, X2) have a bivariate normal distribution, and 
suppose that we are interested in the probability of the event {X, <c,, X2 < co} for 
specific values c1, c>. In general, we cannot explicitly calculate the double integral 


c2 cy 
i / SF (4, X9) dx, dx, (12.4.4) 
—oo J—0o 


where f (x1, x2) is the joint p.d-f. of (X;, X2). We can write the joint p.d-f. as f (x4, x2) = 
81(X1|X2) fo(xo), where g, is the conditional p.d.f. of X, given X, =x, and f, is the 
marginal p.df. of X>. Both of these p.d-f’s are normal p.d.f’s, as we learned in 
Sec. 5.10. In particular, the conditional distribution of X, given X7 = x 1s the normal 
distribution with mean and variance given by Eq. (5.10.8). We can explicitly perform 
the inner integration in (12.4.4) as 


Cl cy 
/ Flex) dey = f 8(%4|X2) fo(x2) dxy 


= Cy — Ly — PO4(X2 — M2) /02 
7 ftano( o(1 — p)/? ) 


where © is the standard normal c.d.f. The integral in (12.4.4) is then the integral of 
this last expression with respect to x7. An efficient importance function might be the 
conditional p.d.f. of X2 given that X> <c). That is, let h be the p.d.f. 


1 
(2103)? exp (dat 7 ay 
205 


© (2 = Ha) 
09 

It is not difficult to see that if U has the uniform distribution on the interval [0, 1], 

then 


A(x.) = 


for —oo < X92 < C2. (12.4.5) 


Wat »o"|ve(2—#2)| (12.4.6) 


02 


has the p.d.f. 2. (See Exercise 5.) If we use / as an importance function and simulate 
w®,..., W™ with this p.d.f., then our estimator of the integral (12.4.4) is 


ie cy — By — poy(W” — p)/oy o(2—42) “ 
U j=] o4(1 ~ p)'/? 


02 
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It is not always possible to guarantee that an importance sampling estimator 
will have finite variance. In the examples in this section, we have managed to find 
importance functions with the following property. The ratio of the function being 
integrated to the importance function is bounded. This property guarantees finite 
variance for the importance sampling estimator. (See Exercise 8.) 


Stratified Importance Sampling 


Suppose that we are trying to estimate 9 = f{ g(x) dx, and that we contemplate using 
the importance function f. The simulation variance of the importance sampling 
estimator of @ arises from the variance of Y = g(X)/f(X), where X has the p.d.f. 
f. Indeed, if we simulate an ie ees sample of size n, the simulation variance of 
our estimator is o7/n, where o* = Var(Y). Stratified se aaa sampling attempts 
to reduce the simulation variance by splitting 6 into 6 = be and then estimating 
each 6; with much smaller simulation variance. 

The stratified importance sampling algorithm is easiest to describe when X is 
simulated using the probability integral transformation. Let F be the c.d.f. corre- 
sponding to the p.d.f. f. First, we split 6 as follows. Define gg = —00, qj = F-!(j/k) 
for j=1,...,k —1, and gq, = o&. Then define 


qj 
6; = g(x) dx, 
ie 


j-1 


jai 9 


for j=1,...,k. Clearly, 6 = ae , 0;- Next, we estimate each 6; by importance sam- 
pling using the same importance function /, but restricted to the range of integration 
for @;. That is, we estimate 0; using importance sampling with the importance function 


_ JRF) ifgji1=%x <4;, 

fya) = 0 otherwise. 

(See Exercise 9 to see that f; is indeed a p.d.f.) To simulate a random variable with 
the p.d.f. f;, let V have the uniform distribution on the interval [(j — 1)/k, j/k] and 
set Xj; = F~|(V). The reader can prove (see Exercise 9) that X ; has the p.d.f. f;. Let 
oe be the variance of g(X ;)/f;(X ;). Suppose that, for each j = 1, ..., k, we simulate 
an importance sample of size m with the same distribution as X ;. The variance of the 
estimator of 0; will be o%/m. Since the k estimators of 6), ..., 6, are independent, 
the variance of the estimator of @ will be Es o/ m. To facilitate comparison to 
nonstratified importance sampling, let n = mk. Stratification will be an improvement 
if its variance is smaller than o7/n. Since n = mk, we would like to prove that at least 


oak) a, (12.4.7) 


and preferably with strict inequality. 

To prove (12.4.7), we note a close connection between the random variables X ; 
with the p.d.f. f; and X with the p.d.f. f. Let J be a random variable with the discrete 
uniform distribution on the integers 1, ..., k. Define X* = X ;, so that the conditional 
p.d.f. of X* given J = j is f;. You can prove (Exercise 11) that X* and X have the 
same p.d.f. Let Y = g(X)/f(X) and 


yx BO) g(X) 
fi XR) kf(X*) 
Then Var(Y*|J = j) = o% and kY* has the same distribution as Y. So, 


Example 


12.4.4 
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o* = Var(Y) = Var(kY*) =k? Var(Y*). (12.4.8) 
Theorem 4.7.4 says that 
Var(Y*) = E Var(Y*|J) + Var[E(Y*|J)]. (12.4.9) 


By construction, E(Y*|J = j) = 6; and Var(Y*|J = j) = oe Also, Var[E(¥*|J)] = 0 
with strict inequality if the 6; are not all the same. Since Pr(J = j) =1/k for j = 
1,...,k, we have 


k 
dl 
* as 2 
EVar(¥*|J) = > re (12.4.10) 
j= 


Combining Eqs. (12.4.8), (12.4.9), and (12.4.10), we obtain (12.4.7), with strict in- 
equality if the 6; are not all equal. 


Illustration of Stratified Importance Sampling. Consider the integral that we wanted 
to estimate in Example 12.4.1. The best importance function appeared to be /f3, 
with a simulation standard error of 63/100 = 9.73 x 10~*. In the present example, 
we allocate 10,000 simulations among k = 10 subsets of size m = 1000 each and do 
stratified importance sampling by dividing the range of integration [0, 1] into 10 
equal-length subintervals. Doing this, we get a Monte Carlo estimate of the integral of 
0.5248. To estimate the simulation standard error, we need to estimate each o; by oF 


and compute ye 1 ae /1000. In the simulation that we are discussing, the simulation 


standard error for stratified importance sampling is 1.05 x 10~*, about one-tenth as 
small as the unstratified version. We can also do stratified importance sampling using 
k = 100 subsets of size m = 100. In our simulation, the estimate of the integral is the 
same with simulation standard error of 1.036 x 10~°. | 


The reason that stratified importance sampling works so well in Example 12.4.4 is 
that the function g(x)/f3(x) is monotone, and this makes 6; change about as much as 
it can as j changes. Hence, Var[E(Y*|J/)]is large, making stratification very effective. 


ee 


Exercises 


¢ 


Summary 


We introduced the method of importance sampling for calculating integrals by simu- 
lation. The idea of importance sampling for estimating [ g(x) dx is to choose a p.d.f. 
f from which we can simulate and such that g(x)/f(x) is nearly constant. Then we 
rewrite the integral as {[g(x)/f (x)]f (x) dx. We can estimate this last integral by aver- 
aging g(X)/f(X) where X,..., X™ forma random sample with the p.d.f. f. A 
stratified version of importance sampling can produce estimators with even smaller 
variance. 


1. Prove that the formula in Eq. (12.4.1) is the same as that we can simulate pseudo-random values with the p.d.f. 
importance sampling in which the importance function is f. Prove that the following are the same: 


the p.d.f. of the uniform distribution on the interval [a, b]. 


« Simulate X values with the p.d.f. f, and average 
the values of g(X) to obtain the estimator. 


2. Let g be a function, and suppose that we wish to com- « Do importance sampling with importance function 
pute the mean of g(X) where X has the p.d.f. f. Suppose f to estimate the integral [ g(x) f(x) dx. 
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3. Let Y have the F distribution with m and n degrees 
of freedom. We wish to estimate Pr(Y > c). Consider the 
p.df. 


(n/2)cr/2 


f(x) = yn/2+1 
0 


ifx>c, 
otherwise. 


a. Explain how to simulate pseudo-random numbers 
with the p.d.f. f. 


b. Explain how to estimate Pr(Y > c) using importance 
sampling with the importance function f. 


c. Look at the form of the p.d.f. of Y, Eq. (9.7.2), and 
explain why importance sampling might be more ef- 
ficient than sampling i.i.d. F random variables with 
m and n degrees of freedom if c is not small. 


4. We would like to calculate the integral i log + 
x) exp(—x) dx. 


a. Simulate 10,000 exponential random variables with 
parameter 1 and use these to estimate the integral. 
Also, find the simulation standard error of your esti- 
mator. 


b. Simulate 10,000 gamma random variables with pa- 
rameters 1.5 and 1 and use these to estimate the 
integral (importance sampling). Find the simulation 
standard error of the estimator. (In case you do not 
have the gamma function available, (1.5) = ./z/2.) 


c. Which of the two methods appears to be more effi- 
cient? Can you explain why? 


5. Let U have the uniform distribution on the inter- 
val [0, 1]. Show that the random variable W defined in 
Eq. (12.4.6) has the p.d.f. # defined in Eq. (12.4.5). 


6. Suppose that we wish to estimate the integral 


oe) 2 
i 05x? ay. 
1 20 
In parts (a) and (b) below, use simulation sizes of 1000. 


a. Estimate the integral by importance sampling using 
random variables having a truncated normal distri- 
bution. That is, the importance function is 


1 —0.5x? 

—————-e forx >1. 

/2n[1— &(1)] , 

b. Estimate the integral by importance sampling using 

random variables with the p.d.f. x exp(0.5[1 — x*]), 

for x > 1. Hint: Prove that such random variables can 

be obtained as follows: Start with a random variable 

that has the exponential distribution with parameter 
0.5, add 1, then take the square root. 


c. Compute and compare simulation standard errors 
for the two estimators in parts (a) and (b). Can you 
explain why one is so much smaller than the other? 


7. Let (X,, X) have the bivariate normal distribution 
with both means equal to 0, both variances equal to 1, 
and the correlation equal to 0.5. We wish to estimate 
6 = Pr(X, <2, X> < 1) using simulation. 


a. Simulate a sample of 10,000 bivariate normal vec- 
tors with the above distribution. Use the proportion 
of vectors satisfying the two inequalities X, <2 and 
X> < 1as the estimator Z of 8. Also compute the sim- 
ulation standard error of Z. 


b. Use the method described in Example 12.4.3 with 
10,000 simulations to produce an alternative estima- 
tor Z’ of 6. Compute the simulation standard error 
of Z' and compare Z’ to the estimate in part (a). 


8. Suppose that we wish to approximate the integral 
J g(x) dx. Suppose that we have a p.d.f. f that we shall 
use as an importance function. Suppose that g(x)/f (x) is 
bounded. Prove that the importance sampling estimator 
has finite variance. 


9. Let F be a continuous strictly increasing c.d.f. with 
p.df. f. Let V have the uniform distribution on the in- 
terval [a, b] with 0<a<b<1. Prove that the p.d.f. of 
X = F-1(V) is f(x)/(b —a) for F~!(a) <x < F7\(b). (If 
a=0, let F(a) = —oo. If b = 1, let F~!(b) = 00.) 


10. For the situation described in Exercise 6, use strat- 
ified importance sampling as follows: Divide the interval 
(1, oo) into five intervals that each have probability 0.2 un- 
der the importance distribution. Sample 200 observations 
from each interval. Compute the simulation standard er- 
ror. Compare this simulation to the simulation in Exer- 
cise 6 for each of parts (a) and (b). 


11. In the notation used to develop stratified importance 
sampling, prove that X* = X, and X have the same distri- 
bution. Hint: The conditional p.d.f. of X* given J = j is f;. 
Use the law of total probability. 


12. Consider again the situation described in Exercise 15 
of Sec. 12.2. Suppose that W,, has the Laplace distribution 
with parameters @ = 0 ando = 0.1u'/2. See Eq. (10.7.5) for 
the p.d-f. 


a. Prove that the m.g-f. of W,, is 


2 —1 
v(t) = (.- Fz) » for —10u7!/? < ¢ <10u7!/2. 


b. Let r = 0.06 be the risk-free interest rate. Simulate 
a large number v of values of W,, with u = 1 and use 
these to estimate the price of an option to buy one 
share of this stock at time u =1 in the future for 
the current price Sp. Also compute the simulation 
standard error. 


c. Use importance sampling to improve on the simu- 
lation in part (b). Instead of simulating W,, values 
directly, simulate from the conditional distribution 
of W,, given that S,, > Sp. How much smaller is the 
simulation standard error? 


13. The method of control variates is a technique for re- 
ducing the variance of a simulation estimator. Suppose 
that we wish to estimate 9 = E(W). A control variate is an- 
other random variable V that is positively correlated with 
W and whose mean py we know. Then, for every constant 
k>0, EW -kV +k) = 90. Also, if k is chosen carefully, 
Var(W —kV +k) < Var(W). In this exercise, we shall see 
how to use control variates for importance sampling, but 
the method is very general. Suppose that we wish to com- 
pute J g(x) dx, and we wish to use the importance function 
f. Suppose that there is a function h such that / is similar 
to g but f h(x) dx is known to equal the value c. Let k be 
a constant. Simulate X,..., X) with the p.d.f. f, and 


define 
w= g(X) 
f(X@)’ 
yO= n(x) 
f(X@y’ 


YO = w —Kv®, 


for all i. Our estimator of f g(x) dx is then 
Z= 1 Wii YO + ke. 
a. Prove that E(Z) = f g(x) dx. 
b. Let Var(W) =o7, and Var(V) =o7. Let p be 
the correlation between W“) andV“.. Prove that 


the value of k that makes Var(Z) the smallest is 
k =owp/oy. 


14. Suppose that we wish to integrate the same function 
g(x) as in Example 12.4.1. 


a. Use the method of control variates that was de- 
scribed in Exercise 13 to estimate f g(x) dx. Let 
h(x) =1/(.+ x?) for 0 <x <1, and k =e. (This 
makes h about the same size as g.) Let f(x) be the 
function f3 in Example 12.4.1. How does the simu- 
lation standard error using control variates compare 
to not using control variates? 


b. Estimate the variances and correlation of the W's 
and V’s (notation of Exercise 13) to see what a 
good value for k might be. 
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15. The method of antithetic variates is a technique for re- 
ducing the variance of simulation estimators. Antithetic 
variates are negatively correlated random variables that 
share a common mean and common variance. The vari- 
ance of the average of two antithetic variates is smaller 
than the variance of the average of two i.i.d. variables. In 
this exercise, we shall see how to use antithetic variates for 
importance sampling, but the method is very general. Sup- 
pose that we wish to compute y g(x) dx, and we wish to 
use the importance function f. Suppose that we generate 
pseudo-random variables with the p.d.f. f using the prob- 
ability integral transformation. That is, fori =1,..., v,let 
X© = F-1(U), where U has the uniform distribution 
on the interval [0, 1] and F is the c.d.f. corresponding to 
the p.d.f. f. For eachi =1,..., v, define 


T® = F4qa—v®), 


@ _ g(X) 
fo)’ 
yOu g(T”) 
Frey’ 


y®=05 [w° a vo] 


Our estimator of f g(x) dx is then Z = ¢ DY? Y®. 


a. Prove that 7 has the same distribution as X. 
b. Prove that E(Z) = f g(x) dx. 


c. If g(x)/f(x) is a monotone function, explain why we 
would expect W and V to be negatively corre- 
lated. 


d. If W® and V are negatively correlated, show that 
Var(Z) is less than the variance one would get with 
2v simulations without antithetic variates. 


16. Use the method of antithetic variates that was de- 
scribed in Exercise 15. Let g(x) be the function that we 
tried to integrate in Example 12.4.1. Let f(x) be the func- 
tion f; in Example 12.4.1. Estimate Var(Y), and com- 
pare it to Ge from Example 12.4.1. 


17. For each of the exercises in this section that requires 
a simulation, see if you can think of a way to use control 
variates or antithetic variates to reduce the variance of the 
simulation estimator. 


* 12.5 Markov Chain Monte Carlo 


The techniques described in Sec. 12.3 for generating pseudo-random numbers with 
particular distributions are most useful for univariate distributions. They can be 
applied in many multivariate cases, but they often become unwieldy. A method 
based on Markov chains (see Sec. 3.10) became popular after publications by 
Metropolis et al. (1953) and Gelfand and Smith (1990). We shall present only the 
simplest form of Markov chain Monte Carlo in this section. 
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The Gibbs Sampling Algorithm 


We shall begin with an attempt to simulate a bivariate distribution. Suppose that the 
joint p.d.f. of (X1, Xo) is f(xy, x2) = cg (x1, X2), Where we know the function g but 
not necessarily the value of the constant c. This type of situation arises often when 
computing posterior distributions. If X; and X, are the parameters, the function g 
might be the product of the prior p.d.f. times the likelihood function (in which the data 
are treated as known values). The constant c = 1/ [ g(x1, x2) dx; dxz makes cg (x1, x2) 
the posterior p.d.f. Often it is difficult to compute c, although the methods of Sec. 12.4 
might be helpful. Even if we can approximate the constant c, there are other features 
of the posterior distribution that we might not be able to compute easily, so simulation 
would be useful. 

If the function g(x1, x) has a special form, then there is a powerful algorithm for 
simulating vectors with the p.d.f. f. The required form can be described as follows: 
First, consider g(x, x2) as a function of x, for fixed x. This function needs to look 
like a p.d.f. (for X,) from which we know how to simulate pseudo-random values. 
Similarly, if we consider g(x1, x) as a function of x, for fixed x, the function needs 
to look like a p.d.f. for X> from which can simulate. 


Sample from a Normal Distribution. Suppose that we have observed a sample from the 
normal distribution with unknown mean jz and unknown precision t. Suppose that 
we use a natural conjugate prior of the form described in Sec. 8.6. The product of 
the prior and the likelihood is given by Eq. (8.6.7) without the appropriate constant 
factor. We reproduce a version of that equation here for convenience: 


Fuca exp( =r] Sai — 4)? + ii]). 


where a1, $,, 44, and A, are known values once the data have been observed. 
Considering this as a function of yz for fixed Tt, it looks like the p.d.f. of the normal 
distribution with mean jz, and variance (tA,)~!. Considering it as a function of t for 
fixed ju, it looks like the p.d.f. of the gamma distribution with parameters a, + 1/2 
and A4(j2 — 44)*/2 + B}. Both of these distributions are easy to simulate. <i 


When we consider g(x1, x2) as a function of x, for fixed x2, we are looking at the 
conditional p.d.f. of X; given X> = x7, except for a multiplicative factor that does not 
depend on x,. (See Exercise 1.) Similarly, when we consider g(x, x2) as a function 
of x» for fixed x,, we are looking at the conditional p.d.f. of X, given X; = x1. 

Once we have determined that the function g(x, x) has the desired form, our 
algorithm proceeds as follows: 


1. Pick a starting value a for X>, and set i = 0. 


2. Simulate a new value ls from the conditional distribution of X; given X> = 
X45 . 

3. Simulate a new value ies 

xt, 


from the conditional distribution of X> given X= 


4. Replace i by i + 1 and return to step 2. 


The algorithm typically terminates when i reaches a sufficiently large value. Although 
there currently are no truly satisfactory convergence criteria, we shall introduce one 
convergence criterion later in this section. This algorithm is commonly called Gibbs 
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sampling. The name derives from an early use of the technique by Geman and Geman 
(1984) for sampling from a distribution that was known as the Gibbs distribution. 


Some Theoretical Justification 


So far, we have given no justification for the Gibbs sampling algorithm. The justifica- 
tion stems from the fact that the successive pairs Go”, rhe ), Ga’. =) ... form the 
observed sequence of states from a Markov chain. This Markov chain is much more 
complicated than any of the Markov chains encountered in Sec. 3.10 for two reasons. 
First, the states are two-dimensional, and second, the number of possible states is 
infinite rather than finite. Even so, one can easily recognize the basic structure of a 
Markov chain in the description of the Gibbs sampling algorithm. Suppose that i is 
the current value of the iteration index. The conditional distribution of the next state 
pair 6 ome xo) given all of the available state pairs oe, xy, Lees (x xo) 
depends only on (X ‘o xX Os the current state pair. This is the same as the defining 
property of finite Markov chains in Sec. 3.10. 

Even if we agree that the sequence of pairs forms a Markov chain, why should 
we believe that they come from the desired distribution? The answer lies in a gen- 
eralization of the second part of Theorem 3.10.4 to more general Markov chains. 
The generalization is mathematically too involved to present here, and it requires 
conditions that involve concepts that we have not introduced in this book. 

Nevertheless, the Gibbs sampler is constructed from a joint distribution that one 
can show (see Exercise 2) is a stationary distribution for the resulting Markov chain. 
For the cases that we illustrate in this book, the distribution of the Gibbs sampler 
Markov chain does indeed converge to this stationary distribution as the number 
of transitions increases. (For a more general discussion, see Tierney, 1994.) Because 
of the close connection with Markov chains, Gibbs sampling (and several related 
techniques) are often called Markov chain Monte Carlo. 


When Does the Markov Chain Converge? 


Although the distribution of a Markov chain may converge to its stationary distri- 
bution, after any finite time the distribution will not necessarily be the stationary 
distribution. In general, the distribution will get pretty close to the stationary dis- 
tribution in finite time, but how do we tell, in a particular application, if we have 
sampled the Markov chain long enough to be confident that we are sampling from 
something close to the stationary distribution? Much work has been done to address 
this question, but there is no foolproof method. Several methods for assessing con- 
vergence of a Markov chain in a Monte Carlo analysis were reviewed by Cowles and 
Carlin (1996). Here we present one simple technique. 

Begin by sampling several versions of the Markov chain starting at k different 
initial values ou ee Se These k Markov chains will be useful not only for assess- 
ing convergence but also for estimating the variances of our simulation estimators. 
It is wise to choose the initial values ree iisich Pia to be quite spread out. This will 
help us to determine whether we have a Markov chain that is very slow to converge. 
Next, apply the Gibbs sampling algorithm starting at each of the k initial values. This 
gives us k independent Markov chains, all with the same stationary distribution. If 
the k Markov chains have been sampled for m iterations, we can think of the ob- 
served values of X, (or of Xz) as k samples of size m each. For ease of notation, let 
T,,; stand for either the value of X, or the value of X, from the jth iteration of the 
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ith Markov chain. (We shall repeat the following analysis once for X, and once for 
X.) Now, treat 7; ; for j =1,...,m as asample of size m from the ith of k distribu- 
tions fori = 1, ..., k. If we have sampled long enough for the Markov chains to have 
converged approximately, then all k of these distributions should be nearly the same. 
This suggests that we use the F statistic from the discussion of analysis of variance 
(Sec. 11.6) to measure how close the k distributions are. The F statistic can be written 
as F = B/W where 


k 
m — — 
B=—— Yi Ti+ =Tey, 
k-1 aa 
1 ko om 
(ea — 1, — Ti). 
k(m — 1) pa a a iW) 


i=l j=l 


Here we have used the same notation as in Sec. 11.6 in which the + subscript 
appears in a position wherever we have averaged over all values of the subscript 
in that position. If the & distributions are different, then F should be large. If the 
distributions are the same, then F should be close to 1. As we mentioned earlier, 
we compute two F statistics, one using the X, coordinates and one using the X 
coordinates. Then we could declare that we have sampled long enough when both 
F statistics are simultaneously less than some number slightly larger than 1. Gelman 
et al. (1995) describe essentially this same procedure and recommend comparing 
the maximum of the two F statistics to 1+ 0.44m. It is probably a good idea to 
start with at least m = 100 (if the iterations are fast enough) before beginning to 
compute the F statistics. This will help to avoid accidentally declaring success due 
to some “lucky” early simulations. The initial sequence of iterations of the Markov 
chain, before we declare convergence, is commonly called burn-in. After the burn-in 
iterations, one would typically treat the ensuing iterations as observations from the 
stationary distribution. It is common to discard the burn-in iterations because we are 
not confident that their distribution is close to the stationary distribution. Iterations 
of a Markov Chain are dependent, however, so one should not treat them as an 
i.i.d. sample. Even though we computed an F statistic from the various dependent 
observations, we did not claim that the statistic had an F distribution. Nor did we 
compare the statistic to a quantile of an F distribution to make our decision about 
convergence. We merely used the statistic as an ad hoc measure of how different the 
k Markov chains are. 


Nursing Homes in New Mexico. We shall use the data from Sec. 8.6 on the numbers of 
medical in-patient days in 18 nonrural nursing homes in New Mexico in 1988. There, 
we modeled the observations as a random sample from the normal distribution with 
unknown mean yj and unknown precision t. We used a natural conjugate prior and 
found the posterior hyperparameters to be a, = 11, By = 50925.37, 1 = 183.95, and 
A, = 20. We shall illustrate the above convergence diagnostic for the Gibbs sam- 
pling algorithm described in Example 12.5.1. As we found in Example 12.5.1, the 
conditional distribution of jz given t is the normal distribution with mean 183.95 
and variance (20t)~!. The conditional distribution of t given jz is the gamma dis- 
tribution with parameters 11.5 and 50925.37 + 20(u — 183.95)”. We shall start with 
the following k = 5 initial values for jw: 182.17, 227, 272, 137, 82. These were chosen 
by making a crude approximation to the posterior standard deviation of 4, namely, 
(B;/[A,a;])/? ~ 15, and then using the posterior mean together with values 3 and 6 
posterior standard deviations above and below the posterior mean. We have to run 
the five Markov chains to the m = 2 iteration before we can compute the F statistics. 


Example 
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In our simulation, at m = 2, the larger of the two F statistics was already as low as 
0.8862, and it stayed very close to 1 all the way to m = 100, at which time it seemed 
clear that we should stop the burn-in. S| 


Estimation Based on Gibbs Sampling 


So far, we have argued (without proof) that if we run the Gibbs sampling algorithm 
for many iterations (through burn-in), we should start to see pairs (X a oa whose 
joint p.d.f. is nearly the function f from which we wanted to sample. Unfortunately, 
the successive pairs are not independent of each other even if they do have the 
same distribution. The law of large numbers does not tell us that the average of 
dependent random variables with the same distribution converges. However, the 
type of dependence that we get from a Markov chain is sufficiently regular that there 
are theorems that guarantee convergence of averages and even that the averages are 
asymptotically normal. That is, suppose that we wish to estimate the mean jz of some 
function h(X,, Xz) based on m observations from the Markov chain. We can still 
assume that s ype a x) converges to jz, and that it has approximately the 
normal distribution with mean jz and variance o7/m. However, the convergence will 
typically be slower than for i.i.d. samples, and o? will be larger than the variance of 
h(X,, X). The reason for this is that the successive values of h(X ao x ‘. are usually 
positively correlated. The variance of an average of positively correlated identically 
distributed random variables is higher than the variance of an average of the same 
number of i.i.d. random variables. (See Exercise 4.) 

We shall deal with the problems caused by correlated samples by making use of 
the same k independent Markov chains that we used for determining how much burn- 
in to do. Discard the burn-in and continue to sample each Markov chain for mp more 
iterations. From each Markov chain, we compute our desired estimator, either an 
average, a sample quantile, a sample variance, or other measure, Z; for j =1,...,k. 
We then compute S as in Eq. (12.2.2); that is, 


‘ 1/2 
c= ; eZ 297 . (12.5.1) 
j=l 


Then S? is an estimator of the simulation variance of the Z js. Write the simulation 
variance as o*/mp and estimate o7 by 6? = mS” as we did in Example 12.2.9. Also, 
combine all samples from all k chains into a single sample, and use this single sample 
to form the overall estimator Z. The simulation standard error of our estimator Z is 
then (67/(mok))/? = S/k'/2. 

In addition, we may wish to determine how many simulations to perform in 
order to obtain a precise estimator. We can substitute ¢ for o in Eq. (12.2.5) to get 
a proposed number of simulations v. These v simulations would be divided between 
the k Markov chains so that each chain would be run for at least v/k iterations if 
vo/k > Mo. 


Some Examples 


Nursing Homes in New Mexico. We actually do not need Gibbs sampling in order to 
simulate a sample from the posterior distribution in Example 12.5.1. The reason is 
that we have a closed-form expression for the joint distribution of w and rt in that 
example. Each of the marginal and conditional distributions are known and easy 
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Figure 12.7 Quantile plots of 4 and t values simulated from the 
posterior distribution in Example 12.5.3. The line on each plot shows 
the quantiles of the actual posterior distribution as found in Sec. 8.6. The 
horizontal axis on the left plot is labeled by quantiles of the ¢ distribution 
with 22 degrees of freedom. The actual posterior of yw is a rescaling and 
shifting of this ¢ distribution. The horizontal axis on the right plot is 
labeled by quantiles of the gamma distribution with parameters 11 and 
1. The actual posterior of t is a rescaling of this gamma distribution. 


to simulate. Gibbs sampling is most useful when only the conditionals are easy to 
simulate. However, we can illustrate the use of Gibbs sampling in Example 12.5.1 
and compare the simulated results to the known marginal distributions of jz and rt. 

In Example 12.5.2, we started k = 5 Markov chains and burned them in for 100 
iterations. Now we wish to produce a sample of (jw, t) pairs from the joint posterior 
distribution. After burn-in, we run another my = 1000 iterations for each chain. These 
iterations produce five correlated sequence of (uw, tT) pairs. The correlations between 
successive pairs of jz values are quite small. The same is true of successive t values. 
To compare the results with the known posterior distributions found in Sec. 8.6, 
Fig. 12.7 has a t quantile plot of the 4 values and a gamma quantile plot of the t 
values. (Normal quantile plots were introduced on page 720. Gamma and ¢ quantile 
plots are constructed in the same way using gamma and ¢ quantile functions in place 
of the standard normal quantile function.) The simulated values seem to lie close to 
the lines drawn on the plots in Fig. 12.7. (A few points in the tails stray a bit from 
the lines, but this occurs with virtually all quantile plots.) The lines in Fig. 12.7 show 
the quantiles of the actual posterior distributions, which are a ¢ distribution with 22 
degrees of freedom multiplied by 15.21 and centered at 183.95 for 1 and the gamma 
distribution with parameters 11 and 50925.37 for t. 

We can use the sample of (uw, T) pairs to estimate the posterior mean of an 
arbitrary function of (4, tT). For example, suppose that we are interested in the mean 
6 of + 1.645/r'/?, which is the 0.95 quantile of the unknown distribution of the 
original observations. The average of our 5000 simulated values of jz + 1.645/t!/? is 
Z = 299.67. The value of S from Eq. (12.5.1) is 0.4119, giving us a value of 6 = 13.03. 
The simulation standard error of Z is then 6 /5000'/* = 0.1842. The true posterior 
mean of + 1.645/t!/? can be computed exactly in this example, and it is 

1.6452 E1— >) _ 299 88, 
My + By Pap 
a bit more than 1 simulation standard error away from our simulated value of Z. 
Suppose that we want our estimator of @ to be within 0.01 of the true value with 
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probability 0.99. Substituting these values and 6 = 13.03 into Eq. (12.2.5), we find 
that we need v = 12,358,425 total simulations. Each of our five Markov chains would 
have to be run for 2,251,685 iterations. <4 


The true value of Gibbs sampling begins to emerge in problems with more 
than two parameters. The general Gibbs sampling algorithm for p random variables 
(X1,..., Xp) with p.d.f. f(x) = cg(x) is as follows. First, verify that g looks like an 
easy-to-simulate p.d.f. as a function of each variable for fixed values of all the others. 
Then perform these steps: 


1. Pick starting values oe Sari aor for X2,..., Xp, and seti =0. 
2. Simulate a new value ac from the conditional distribution of X, given X> = 
(i) _ 
Xpress Xp ax 
3. Simulate a new value re from the conditional distribution of X> given X; = 
(+1) (i) j 
xy »Xg= Ay or Xp Hay. 


p +1. Simulate a new value ae from the conditional distribution of X,, given 
(i+) (i+) 
As, piss Mp =O 5s 


p +2. Replace i by i + 1, and return to step 2. 


The sequence of successive p-tuples of (Xj, ..., X,) values produced by this algo- 
rithm is a Markov chain in the same sense as before. The stationary distribution of 
this Markov chain has the p.d.f. f, and the distribution of an iteration many steps 
after the start should be approximately the stationary distribution. 


Multiple Regression with an Improper Prior. Consider a problem in which we observe 
data consisting of triples (¥;, x4;, x2;) fori =1,...,. We assume that the x; values 
are known, and we model the distribution of Y; as the normal distribution with mean 
Bo + 61x41; + 6x2; and precision t. This is the multiple regression model introduced 
in Sec. 11.5 with the variance replaced by 1 over the precision. Suppose that we use 
the improper prior &(f 9, 61, 62, T) =1/t for the parameters. The posterior p.d-f. of 
the parameters is then proportional to the likelihood times 1/t, which is a constant 
times 


n 

= T 

ee exp (-: S01 = Bo — Bit — bons?) (12.5.2) 
i=1 

To simplify the ensuing formulas, we shall define some summaries of the data: 


n 


es Me a4 nt gat 
n n n 
= 2 _ 2 = 
Sy= Nij> 522 = X95> 512 = X1jX2i> 
i=l i=l t=1. 
n n n 
= _ _ 2 
S1y = Xi Vio SQy = Xi Yi» Syy = Jie 
i=1 i=l i=l 


Looking at (12.5.2) as a function of t for fixed values of Bp, 6;, and fy», it looks 
like the p.d.f. of the gamma distribution with parameters n/2 and )~"_,(y; — Bo — 
B 1X4; — BoXo;)7/2. Looking at (12.5.2) as a function of B ; for fixed values of the other 
parameters, it is e to the power of a quadratic in 6; with negative coefficient on the 
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a term. As such, it looks like the p.d.f. of a normal random variable with a mean 
that depends on the data and the other §’s and a variance that equals 1/t times 
some function of the data. We can be more specific if we complete the square in the 
expression )7"_,(y; — Bo — Bix — B>X;)* three times, each time treating a different 
B; as the variable of interest. For example, treating Ao as the variable of interest, we 
get 
n 
2 _ = = a2 

SY" 0% = Bo — Bits — Box2i)? =n (Bo — [¥ — B11 — Box)”. 

i=1 
plus a term that does not depend on po. So, the conditional distribution of fy given 
the remaining parameters is the normal distribution with mean y — 6x, — Box, and 
variance 1/[nt]. Treating f, as the variable of interest, we get 


n 
2 2 
XxGr — Bo — Byx4; — Box2i)" = 541(B, — 4)", 
i=1 
plus a term that does not depend on 6,, where 
1 = 
wy = — (51 — Bon X1 — Bo512) - 
S11 
This means that the conditional distribution of 6, given the other parameters is the 
normal distribution with mean w, and variance (ts,,)~!. Similarly, the conditional 
distribution of 6» given the other parameters is the normal distribution with mean 
wy» and variance (ts 7)~!, where 


il _ 
Wz = — (52) — Bon®2 — Bys12) . < 
522 


Unemployment in the 1950s. In Example 11.5.9, we saw that unemployment data 
from the years 1951-1959 appeared to satisfy the assumptions of the multiple regres- 
sion model better than the data that included the year 1950. Let us use just the last 
nine years of data from this example (in Table 11.12). We shall use an improper prior 
and Gibbs sampling to obtain samples from the posterior distribution of the param- 
eters. The necessary conditional distributions were all given in Example 12.5.4. We 
just need the values of the summary statistics and n = 9: 


¥,=140.7778, *,=6, y=2.789, 
sy = 179585, 520 = 384, sy = 7837, 
Sty = 3580.9, 59, = 169.2, 5, = 78.29. 


Once again, we shall run k =5 Markov chains. In this problem, there are four 
coordinates to the parameter: 6; fori = 0, 1, 2 and t.So, we compute four F statistics 
and burn-in until the largest F is less than 1+ 0.44m. Suppose that this occurs at 
m = 4546. We then sample 10,000 more iterations from each Markov chain. 

Suppose that we want an interval [a, b] that contains 90 percent of the pos- 
terior distribution of 6,. The numbers a and b will be the sample 0.05 and 0.95 
quantiles. Based on our combined sample of 50,000 values of 6, the interval is 
[—0.1178, —0.0553]. In order to assess the uncertainty in the endpoints, we compute 
the 0.05 and 0.95 sample quantiles for each of the five Markov chains. Those values 
are 


0.05 quantiles: — 0.1452, —0.1067, —0.1181, —0.1079, —0.1142 
0.95 quantiles: — 0.0684, —0.0610, —0.0486, —0.0594, —0.0430. 
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Figure 12.8 Diagram of 
hierarchical model in Exam- 
ple 12.5.6. The parameter y 
influences the distributions 
of the ju;’s, while the (1;, t;) 
parameters influence the dis- 
tributions of the Y;,’s. 
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The value of S based on the sample 0.05 quantiles is 0.01567, and the value of S 
based on the sample 0.95 quantiles is 0.01142. To be safe, we shall use the larger of 
these two to estimate the simulation standard errors of our interval endpoints. Since 
each chain was run for mg =10,000 iterations, we have 6 = Sm! ? = 1.567. Suppose 
that we want each endpoint of the interval to be within 0.01 of the corresponding 
true quantile of the distribution of 6; with probability 0.95. (The probability that 
both endpoints are within 0.01 would be a bit smaller, but is harder to compute.) 
We could use Eq. (12.2.5) to compute how many simulations we would need. That 
equation yields v = 94,386, which means that each of our five chains would need 
to be run 18,878 iterations, about twice what we already have. For comparison, a 
90 percent confidence interval for 6, constructed using the methods of Sec. 11.3 is 
[—0.1124, —0.0579]. This is quite close to the posterior probability interval. < 


Although we did not do so in this text, we could have found the posterior 
distribution for Example 12.5.5 in closed form. Indeed, the 90 percent confidence 
interval calculated at the end of the example contains 90 percent of the posterior 
distribution in much the same way that coefficient 1 — ag confidence intervals contain 
posterior probability 1— ag in Sec. 11.4 when we use improper priors. The next 
example is one in which a closed-form solution is not available. 


Bayesian Analysis of One-Way Layout with Unequal Variances. Consider the one-way 
layout that was introduced in Sec. 11.6. There, we assumed that data would be 
observed from each of p normal distributions with possibly different means but 
the same variance. In order to illustrate the added power of Gibbs sampling, we 
shall drop the assumption that each normal distribution has the same variance. That 
is, fori =1,..., p, we shall assume that Y;;,..., Yj,, have the normal distribution 
with mean jz; and precision 1;, and all observations are independent conditional on 
all parameters. Our prior distribution for the parameters will be the following: Let 
4, +++» [Ly be conditionally independent given all other parameters with 4; having 
the normal distribution with mean y and precision Agt;. Here, w is another parameter 
that also needs a distribution. We introduce this parameter wy as a way of saying that 
we think that the yz;’s all come from a common distribution, but we are not willing 
to say for sure where that distribution is located. We then say that y has the normal 
distribution with mean Wo and precision up. For an improper prior, we could set ug = 0 
in what follows, and then Wp would not be needed either. Next, we model 11, ..., T, 
as 1.i.d. having the gamma distribution with parameters ap and fy. We model w and 
the 1;’s as independent. For an improper prior, we could set ag = By = 0. The type 
of model just described is called a hierarchical model because of the way that the 
distributions fall into a hierarchy of levels. Figure 12.8 illustrates the levels of the 
hierarchy in this example. 
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The joint p.d.f. of the observations and the parameters is the product of the 
likelihood function (the p.d.f. of the observations given the j1,;’s and 7;’s) times the 
product of the conditional prior p.d.f.’s of the j;’s given the 1,’s and w, times the 
prior p.d.f.’s of the 1;’s times the prior p.d.f. for y. Aside from constants that depend 
neither on the data nor on the parameters, this product has the form 


os = 3 7 c de nile = FY wag = v*)) 


2 so 2 
p 
die. (12.5.3) 
i=l 
where w; = Oy —y;)* fori =1,..., p. We have arranged terms in (12.5.3) 


so that the terms involving each parameter are close together. This will facilitate 
describing the Gibbs sampling algorithm. 

In order to set up Gibbs sampling, we need to examine (12.5.3) as a function of 
each parameter separately. The parameters are j1j,..., Wp; T,---, Tp; and y. Asa 
function of 1;, (12.5.3) looks like the p.d.f. of the gamma distribution with parameters 
aot (n; + 1)/2 and Bo + [nj (4; — ¥;)? + w; + Ag(u; — W)?]/2. As a function of y, it 
looks like the p.d.f. of the normal distribution with mean [ugwo + Ao yy TMi l/[uo + 
do DP, t;] and precision up + Ap D7, 7;. This is obtained by completing the square 
for all terms involving y. Similarly, by completing the square for all terms involving 
4;, we find that (12.5.3) looks like the normal p.d.f. with mean [n;¥; + Agw]/[n; + Ao] 
and precision 1;(n; + Ao) as a function of j;. All of these distributions are easy to 
simulate. 

Asan example, use the hot dog calorie data from Example 11.6.2. In this example, 
p =4. We shall use a prior distribution in which Ag = a = 1, By = 0.1, up = 0.001, and 
Wo = 170. We use k = 6 Markov chains and do m = 100 burn-in simulations, which 
turn out to be more than enough to make the maximum of all nine F statistics less 
than 1 + 0.44m. We then run each of the six Markov chains another 10,000 iterations. 
The samples from the posterior distribution allow us to answer any questions that we 
might have about the parameters, including some that we would not have been able 
to answer using the analysis done in Chapter 11. For example, the posterior means 
and standard deviations of some of the parameters are listed in Table 12.6. To see 
how different the variances are, we can estimate the probability that the variance of 
one group is at least 2.25 times as high as that of another group by computing the 
fraction of iterations @ in which at least one owl 1!) s 2.25. The result is 0.4, indi- 
cating that there is some chance that at least some of the variances are different. If 
the variances are different, the ANOVA calculations in Chapter 11 are not justified. 

We can also address the question of how much difference there is between the 
j4;’8. For comparison, we shall do the same calculations that we did in Example 12.3.7. 
In 99 percent of the 60,000 simulations, at least one | pO - a | > 26.35. In about one- 


i 
half of the simulations, all | ae — a | > 2.224. And in 99 percent of the simulations, 


the average of the differences was at least 13.78. Figure 12.9 contains a plot of the 
sample c.d.f’s of the largest, smallest, and average of the six |; — ;| differences. 
Careful examination of the results in this example shows that the four j1;’s appear to 
be closer together than we would have thought after the analysis of Example 12.3.7. 
This is typical of what occurs when we use a proper prior in a hierarchical model. 
In Example 12.3.7, the jz;’s were all independent, and they did not have a common 
unknown mean in the prior. In Example 12.5.6, the j1;’s all have a common prior 
distribution with mean 7, which is an additional unknown parameter. The estimation 


Figure 12.9 Sample c.d.f’s 
of the maximum, average, 
and minimum of the six 
|u; — 4;| differences for 
Example 12.5.6. 
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Table 12.6 Posterior means and standard deviations for some parameters in 
Example 12.5.6 


Type Beef Meat Poultry Specialty 
i 1 2 3 4 
E(ujly) 156.6 158.3 120.5 159.6 
(Var(y;|y)) 1/2 4.893 5.825 5.521 7.615 
E(1/t;ly) 495.6 608.5 542.9 568.2 
(Var(1/t;|y))!/ 166.0 221.2 201.6 307.4 
E(wly) = 151.0 (Var(wly))!/? = 11.16 


Sample d.f. 


Maximum difference 
--- Average difference 
seesyaies Minimum difference 


Difference 


of this additional parameter allows the posterior distributions of the jz;’s to be pulled 
toward a location that is near the average of all of the samples. With these data, the 


overall sample average is 147.60. < 


Prediction 


All of the calculations done in the examples of this section have concerned functions 
of the parameters. The sample from the posterior distribution that we obtain from 
Gibbs sampling can also be used to make predictions and form prediction intervals 
for future observations. The most straightforward way to make predictions is to sim- 
ulate the future data conditional on each value of the parameter from the posterior 
sample. Although there are more efficient methods for predicting, this method is easy 


to describe and evaluate. 


Calories in Hot Dogs. In Example 12.5.6, we might be concerned with how different 
we should expect the calorie counts of two hot dogs to be. For example, let Y; and Y3 
be future calorie counts for hot dogs of the beef and poultry varieties, respectively. 
We can form a prediction interval for D = Y, — Y3 as follows: For each iteration £, let 
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the simulated parameter vector be 
t 2 (0) @)  ) _® _(e) @) Le 0) ale 
9 — (us WO, w, pO, ©, , 2, ©, y, B' ) 


For each @, simulate a beef hot dog calorie count ee having the normal distribution 
with mean a and variance 1/ ce Also simulate a poultry hot dog calorie count Y - 
having the normal distribution with mean is and variance 1/ og Then compute 


D® =y\9 —y. Sample quantiles of the values D, ... , D‘°-° can be used to 
estimate quantiles of the distribution of D. 

For example, suppose that we want a 90 percent prediction interval for D. We 
simulate 60,000 D values as above and find the 0.05 and 0.95 sample quantiles to be 
—18.49 and 90,63, which are then the endpoints of our prediction interval. To assess 
how close the simulation estimators are to the actual quantiles of the distribution of 
D, we compute the simulation standard errors of the two endpoints. For the samples 
from each of the k = 6 Markov chains, we can compute the sample 0.05 quantiles of 
our D values. We can then use these values as Z,..., Zin Eq. (12.5.1) to compute a 
value S. Our simulation standard error is then S/6'/?. We can then repeat this for the 
sample 0.95 quantiles. For the two endpoints of our interval, the simulation standard 
errors are 0.2228 and 0.4346, respectively. These simulation standard errors are fairly 
small compared to the length of the prediction interval. < 


Censored Arsenic Measurements. Frey and Edwards (1997) describe the National 
Arsenic Occurrence Survey (NAOS). Several hundred community water systems 
submitted samples of their untreated water in an attempt to help characterize the 
distribution of arsenic across the nation. Arsenic is one of several contaminants that 
the Environmental Protection Agency (EPA) is required to regulate. One difficulty 
in modeling the occurrence of a substance like arsenic is that concentrations are often 
too low to be measured accurately. In such cases, the measurements are censored. 
That is, we only know that the concentration of arsenic is less than some censoring 
point, but not how much less. In the NAOS data set, the censoring point is 0.5 
microgram per liter. Each concentration less than 0.5 microgram per liter is censored. 

Gibbs sampling can help us to estimate the distribution of arsenic in spite of the 
censored observations. Lockwood et al. (2001) do an extensive analysis of the NAOS 
and other data and show how the distribution of arsenic differs from one state to the 
next and from one type of water source to the next. For convenience, let us focus our 
attention on the 24 observations from one state, Ohio. Of those 24 observations, 11 
were taken from groundwater sources (wells). The other 13 came from surface water 
sources (e.g., rivers and lakes). The following are seven uncensored groundwater 
observations from Ohio: 


9.62, 10.50, 2.30, 0.80, 17.04, 9.90, 1.32. 


The other four groundwater observations were censored. 

Suppose that we model groundwater arsenic concentrations in Ohio as hav- 
ing the lognormal distribution with parameters 4 and o”. One popular way to deal 
with censored observations is to treat them like unknown parameters. That is, let 
Y,, ..., Y4 be the four unknown concentrations from the four wells where the mea- 
surements were censored. Let X1,..., X7 stand for the seven uncensored values. 
Suppose that y and t = 1/07 have the normal-gamma prior distribution with hyper- 
parameters (Wg, Ag, ag, and Bp. The joint p.d-f. of X;,..., X7, ¥,..., Y4, and w and t 
is proportional to 
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7 4 
lea ea uo — p19)" +} dog(x;) — 1)? + }“dog(y;) — w)? + 26 | 


i=1 j=l 
The observed data consist of the values x;,..., x7 of X1,..., X7 together with the 
fact that Y; < 0.5 for j =1,..., 4. The conditional distributions of j1 and t given the 


data and the other parameters are just like what we obtained in Example 12.5.1. To 
be precise, 2 has the normal distribution with mean 


Aoto + D]_y log(x;) + L4_, log(y,) 
Ao +11 


and precision t(Ag + 11) conditional on t, the Y;’s, and the data. Also, t has the 
gamma distribution with parameters ap + (11+ 1)/2 and 


7 4 
1 
Bo+ =| Di dogtx;) — w)* +) Wlog(i) — 1)? + role = Ho)” J 
i=1 j=l 


conditional on jz, the Y;’s, and the data. The conditional distribution of the Y;’s given 
jt, T, and the data is that of i.i.d. random variables with the lognormal distribution 
having parameters jz and 1/t but conditional on Y; < 0.5. That is, the conditional 
c.d.f. of each Y; is 

@(f] = 1/2 

F(y)= (lost) = #Ie") , for y < 0.5. 

([log(0.5) — pJc'/) 
We can simulate random variables with c.d.f. F so long as we can compute the 
standard normal c.d.f. and quantile function. Let U have the uniform distribution 
on the interval [0, 1]. Then 


Y =exp(u +776 "[u  ([log(0.5) — uJc/))) 


has the desired c.d.f., F. 

One example of the type of inference that is needed in an analysis of this sort 
is to predict arsenic concentrations for different water systems. Knowing the likely 
sizes of arsenic measurements can help water systems choose economical treatments 
that will meet the standards set by the EPA. For simplicity, we shall simulate one 
arsenic concentration at each iteration of the Markov chain. For example, suppose 
that (uw, +) are the simulated values of yz and 1 at the ith iteration of the Markov 
chain. Then we can simulate Y® = exp(u + Z(t)-"/7), where Z is a standard 
normal random variable. Figure 12.10 shows a histogram of the simulated log(Y) 
values from 10 Markov chains of length 10,000 each. The proportion of predicted 
values that are below the censoring point of log(0.5) is 0.335, with a simulation 
standard error of 0.001. The median predicted value on the logarithmic scale is 0.208 
with a simulation standard error of 0.007. We can transform this back to the original 
scale of measurement using the delta method as described in Example 12.2.8. The 
median predicted arsenic concentration is exp(0.208) = 1.231 micrograms per liter 
with a simulation standard error of 0.007 exp(0.208) = 0.009. <l 


Note: There Are More-General Markov Chain Monte Carlo Algorithms. Gibbs 
sampling requires a special structure for the distribution we wish to simulate. We 
need to be able to simulate the conditional distribution of each coordinate given the 
other coordinates. In many problems, this is not possible for at least some, if not all, 
of the coordinates. If only one coordinate is difficult to simulate, one might try using 
an acceptance/rejection simulator for that one coordinate. If even this does not work, 
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Figure 12.10 Histogram of — Counta 


simulated log(arsenic) values 
for 10,000 iterations from 
each of 10 Markov chains in 
Example 12.5.8. The vertical 
line is at the censoring point, 
log(0.5). 


—5 


(0) 


Predicted log(arsenic) values (micrograms per liter) 


there are more-general Markov chain Monte Carlo algorithms that can be used. The 
simplest of these is the Metropolis algorithm introduced by Metropolis et al. (1953). 
An introduction to the Metropolis algorithm can be found in chapter 11 of Gelman 
et al. (1995) together with a further generalization due to Hastings (1970). 


Summary 


We introduced the Gibbs sampling algorithm that produces a Markov chain of 
observations from a joint distribution of interest. The joint distribution must have a 
special form. As a function of each variable, the joint p.d.f. must look like a p.d.f. from 
which it is easy to simulate pseudo-random variables. The Gibbs sampling algorithm 
cycles through the coordinates, simulating each one conditional on the values of the 
others. The algorithm requires a burn-in period during which the distribution of states 
in the Markov chain converges to the desired distribution. Assessing convergence 
and computing simulation standard errors of simulated values are both facilitated by 


running several independent Markov chains simultaneously. 


Exercises 


1. Let f(xy, x2) = cg(x1, x2) be a joint p.d.f. for (X4, X>). 
For each x9, let ho(x1) = g(x, x2). That is, hy is what we 
get by considering g(x 1, x2) as a function of x, for fixed x. 
Show that there is a multiplicative factor c) that does not 
depend on x, such that h(x 1)c> is the conditional p.d.f. of 
X, given X7 = Xx». 


2. Let f(x, x2) be a joint p.d.f. Suppose that eee x) 
has the joint p.d.-f. f. Let oe, a") be the result of 
applying steps 2 and 3 of the Gibbs sampling algorithm 
on page 824. Prove that a x) and qe —) 
also have the joint p.d.f. f. 


3. Let Z,, Zo, ... form a Markov chain, and assume that 
the distribution of Z; is the stationary distribution. Show 
that the joint distribution of (Z,, Z2) is the same as the 


joint distribution of (Z;, Z;,4) for all i > 1. For conve- 
nience, you may assume that the Markov chain has finite 
state space, but the result holds in general. 


4. Let X1,..., X,, be uncorrelated, each with variance o. 


Let Y;,..., Y,, be positively correlated, each with variance 
o. Prove that the variance of X is smaller than the vari- 
ance of Y. 


5. Use the data consisting of 30 lactic acid concentrations 
in cheese, 10 from Example 8.5.4 and 20 from Exercise 16 
in Sec. 8.6. Fit the same model used in Example 8.6.2 with 
the same prior distributon, but this time use the Gibbs 
sampling algorithm described in Example 12.5.1. Simulate 
10,000 pairs of (42, tT) parameters. Estimate the posterior 
mean of (./T, pw), and compute the simulation standard 
error of the estimator. 


6. Use the data on dishwasher shipments in Table 11.13 on 
page 744. Suppose that we wish to fit a multiple linear re- 
gression model for predicting dishwasher shipments from 
time (year minus 1960) and private residential investment. 
Suppose that the parameters have the improper prior pro- 
portional to 1/t. Use the Gibbs sampling algorithm to 
obtain a sample of size 10,000 from the joint posterior dis- 
tribution of the parameters. 


a. Let 6, be the coefficient of time. Draw a plot of the 
sample c.d.f. of |6;| using your posterior sample. 


b. We are interested in predicting dishwasher ship- 
ments for 1986. 


i. Draw a histogram of the values of By + 268; + 
67.285 from your posterior distribution. 


ii. For each of your simulated parameters, simulate 
a dishwasher sales figure for 1986 (time = 26 and 
private residential investment = 67.2). Compute 
a 90 percent prediction interval from the simu- 
lated values and compare it to the interval found 
in Example 11.5.7. 

iii. Draw a histogram of the simulated 1986 sales 
figures, and compare it to the histogram in part 1. 
Can you explain why one sample seems to have 
larger variance than the other? 


7. Use the data in Table 11.19 on page 762. This time fit the 
model developed in Example 12.5.6. Use the prior hyper- 
parameters Ag = ap = 1, By = 0.1, up = 0.001, and Wo = 800. 
Obtain a sample of 10,000 from the posterior joint distri- 
bution of the parameters. Estimate the posterior means of 
the three parameters /14, (49, and j13. 


8. In this problem, we shall outline a form of robust linear 
regression. Assume throughout the exercise that the data 
consist of pairs (Y;, x;) fori =1,...,. Assume also that 
the x;’s are all known and the Y,’s are independent random 
variables. We shall only deal with simple regression here, 
but the method easily extends to multiple regression. 


a. Let Bo, 6;, ando stand for unknown parameters, and 
let a be a known positive constant. Prove that the 
following two models are equivalent. That is, prove 
that the joint distribution of (Yj, ..., Y,,) is the same 
in both models. 

Model 1: For each i, [Y; — (By + B,x;)]/o has the 
t distribution with a degrees of freedom. 

Model 2: For each i, Y; has the normal distri- 
bution with mean fp + 1x; and variance 1/1; con- 
ditional on 1;. Also, tj,..., T, are iid. having the 
gamma distribution with parameters a/2 and ao?/2. 

Hint: Use the same argument that produced the 
marginal distribution of w in Sec. 8.6 when yu and t 
had a normal-gamma distribution. 


b. Now consider Model 2 from part (a). Let n = 07, 
and assume that n has a prior distribution that is the 
gamma distribution with parameters b/2 and f/2, 
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where b and f are known constants. Assume that the 
parameters fp and £,; have an improper prior with 
“p.d.£” 1. Show that the product of likelihood and 
prior “p.d.f.” is a constant times 


n 
= +1)/2-1 1 
rate) /2 | a )/ exp(—3 [fn 
i=l 


+05 fan ite Bg Pos?) (12.5.4) 
i=l 


c. Consider (12.5.4) as a function of each parameter 
for fixed values of the others. Show that Table 12.7 
specifies the appropriate conditional distribution for 
each parameter given all of the others. 


Table 12.7 Parameters and conditional distributions 
for Exercise 8 


Parameter (12.5.4) looks like the p.d-f. 
of this distribution 

n gamma distribution with parameters 
(na + b)/2 and (f +a d0"_, t%) /2 

Tj gamma distribution with parameters (a + 
1)/2 and [an + (Qj — Bo — Bixi)"\/2 

Bo normal distribution with mean 
ie 1HO%i — B1xi)/ jt; and 
precision )77_, 7; 

By normal distribution with mean 


2 
Ya ai = Po) / Dini t*; and 
Lee n 
precision ))y_) TX4 


9. Use the data in Table 11.5 on page 699. Suppose that Y; 
is the logarithm of pressure and x; is the boiling point for 
the ith observation, i =1,..., 17. Use the robust regres- 
sion scheme described in Exercise 8 with a=5, b=0.1, 
and f =0.1. Estimate the posterior means and standard 
deviations of the parameters fo, 6), and 7. 


10. In this problem, we shall outline a Bayesian solution 
to the problem described in Example 7.5.10 on page 423. 
Let t =1/o? and use a proper normal-gamma prior of 
the form described in Sec. 8.6. In addition to the two 
parameters jz and T, introduce n additional parameters. 

Fori=1,...,n,let Y; = 1if X; came from the normal 
distribution with mean jz and precision t, and let Y; = 0 if 
X; came from the standard normal distribution. 


a. Find the conditional distribution of jz given t; Yj, 


,Y,;and X1,..., Xp. 
b. Find the conditional distribution of t given p; Yj, 
,Y,; and Xy,..., Xp. 


c. Find the conditional distribution of Y; given jy; T; 
X,,..., X,; and the other Y;’s. 
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d. Describe how to find the posterior distribution of 
and t using Gibbs sampling. 


e. Prove that the posterior mean of Y; is the posterior 
probability that X; came from the normal distribu- 
tion with unknown mean and variance. 


11. Consider, once again, the model described in Exam- 
ple 7.5.10. Assume that n = 10 and the observed values of 
X1, a X10 are 


— 0.92, —0.33, —0.09, 0.27, 0.50, —0.60, 1.66, —1.86, 
3.29, 2.30. 


a. Fit the model to the observed data using the Gibbs 
sampling algorithm developed in Exercise 10. Use 
the following prior hyperparameters: ap = 1, By = 1, 
Lo = 0, and Xo =1. 

b. For each i, estimate the posterior probability that 
X; came from the normal distribution with unknown 
mean and variance. 


12. Let Xj,..., X,, be iid. with the normal distribution 
having mean y and precision t. Gibbs sampling allows 
one to use a prior distribution for (uw, tT) in which « and 
t are independent. Let the prior distribution of jz be the 
normal distribution with mean pp and precision yp. Let 
the prior distribution of t be the gamma distribution with 
parameters a and Bp. 


a. Show that Table 12.8 specifies the appropriate con- 
ditional distribution for each parameter given the 
other. 


b. Use the New Mexico nursing home data (Exam- 
ples 12.5.2 and 12.5.3). Let the prior hyperparam- 
eters be a = 2, By = 6300, wo = 200, and yo = 6.35 x 
10+. Implement a Gibbs sampler to find the pos- 
terior distribution of (uw, t). In particular, calculate 
an interval containing 95 percent of the posterior 
distribution of jw. 


Table 12.8 Parameters and conditional distributions 
for Exercise 12 


Prior times likelihood looks like 
the p.d-f. of this distribution 


Parameter 


T gamma distribution with parameters 
ay +n/2 and By + 0.5 7" (4; — X)? + 
0.5n@ — 1)", 

lL normal distribution with mean 
(Yolo + ntx)/(¥ + nt) and precision 
Yo + NT. 


13. Consider again the situation described in Exercise 12. 
This time, we shall let the prior distribution of j« be more 
like it was in the conjugate prior. Introduce another pa- 
rameter y, whose prior distribution is the gamma distri- 


bution with parameters ap and bo. Let the prior distribu- 
tion of jz conditional on y be the normal distribution with 
mean //g and precision y. 


a. Prove that the marginal prior distribution of jz spec- 
ifies that 


bo 7 oh eee 

(2) (4 — [W) has the ¢ distribution 

=a with 2ag degrees of freedom. 

Hint: Look at the derivation of the marginal distri- 
bution of ju in Sec. 8.6. 

b. Suppose that we want the marginal prior distribu- 
tions of both yz and t to be the same as they were 
with the conjugate prior in Sec. 8.6. How must the 
prior hyperparameters be related in order to make 
this happen? 


c. Show that Table 12.9 specifies the appropriate con- 
ditional distribution for each parameter given the 
others. 


Table 12.9 Parameters and conditional distributions 
for Exercise 13 


Parameter Prior times likelihood looks like 
the p.d-f. of this distribution 


T gamma distribution with parameters 
ag +n/2 and By + 0.5 0"_ 1; — 
x)? + 0.5n(¥ — pw)’, 

LL normal distribution with mean 


(Yeo +ntx)/(y +nt) and 
precision y + nT, 


N 


y gamma distribution with parameters 
dg + 1/2 and by + 0.5(j4 — 19). 


d. Use the New Mexico nursing home data (Exam- 
ples 12.5.2 and 12.5.3). Let the prior hyperparam- 
eters be a =2, Bo = 6300, Wo = 200, ag = 2, and 
bo = 3150. Implement a Gibbs sampler to find the 
posterior distribution of (uw, t, y). In particular, cal- 
culate an interval containing 95 percent of the pos- 
terior distribution of jw. 


14. Consider the situation described in Example 12.5.8. In 
addition to the 11 groundwater sources, there are 13 obser- 
vations taken from surface water sources in Ohio. Of the 
13 surface water measurements, only one was censored. 
The 12 uncensored surface water arsenic concentrations 
from Ohio are 


1.93, 0.99, 2.21, 2.29, 1.15, 1.81, 2.26, 3.10, 1.18, 1.00, 
2.67, 2.15. 
Let the prior hyperparameters be 
ay = 0.5, Ho =0, Ag =1, and By= 0.5. 


a. Fit the same model as described in Example 12.5.8, 
and predict a logarithm of surface water concentra- 
tion for each iteration of the Markov chain. 


b. Compare a histogram of your predicted measure- 
ments to the histogram of the underground well pre- 
dictions in Fig. 12.10. Describe the main differences. 


c. Estimate the median of the distribution of predicted 
surface water arsenic concentration and compare 
it to the median of the distribution of predicted 
groundwater concentration. 


15. Let X;,..., X;4,, be a random sample from the ex- 
ponential distribution with parameter 0. Suppose that 6 
has the gamma prior distribution with known parameters 


a and 6. Assume that we get to observe Xj,..., X,, but 
Xnsis--+>Xn4m are censored. 
a. First, suppose that the censoring works as follows: 
For i=1,...,m, we learn only 


that X,,,; <c, but not the precise value of X,,,;. Set 
up a Gibbs sampling algorithm that will allow us to 
simulate the posterior distribution of 6 in spite of the 
censoring. 


b. Next, suppose that the censoring works as follows: 
For i=1,...,m, we learn only 
that X,,,; => c, but not the precise value of X,,,;. Set 
up a Gibbs sampling algorithm that will allow us to 
simulate the posterior distribution of 6 in spite of the 
censoring. 
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16. Suppose that the time to complete a task is the sum 
of two parts X and Y. Let (X;, Y;) fori=1,...,n bea 
random sample of the times to complete the two parts of 
the task. However, for some observations, we get to ob- 
serve only Z; = X; + Y;. To be precise, suppose that we 
observe (X;, ¥;) fori =1,...,k and we observe Z; for 
i=k-+1,...,n. Suppose that all X; and Y; are indepen- 
dent with every X; having the exponential distribution 
with parameter d and every Y; having the exponential dis- 
tribution with parameter jp. 


a. Prove that the conditional distribution of X; given 
Z; =z has the c.d-f. 


G(x|z) = Loexecale = ud 


, forO<x <z. 
1 — exp(—z[A — 1) 


b. Suppose that the prior distribution of (A, jz) is as 
follows: The two parameters are independent with 
A having the gamma distribution with parameters a 
and b, and w having the gamma distribution with 
parameters c and d. The four numbers a, b, c, and 
d are all known constants. Set up a Gibbs sampling 
algorithm that allows us to simulate the posterior 
distribution of (A, i). 


12.6 The Bootstrap 


The parametric and nonparametric bootstraps are methods for replacing an un- 
known distribution F with a known distribution in a probability calculation. If 
we have a sample of data from the distribution F, we first approximate F by F 
and then perform the desired calculation. If F isa good approximation to F, the 
bootstrap can be successful. If the desired calculation is sufficiently difficult, we 


Example 
12.6.1 


typically resort to simulation. 


Introduction 


Assume that we have a sample X = (Xj,..., X,,) of data from some unknown 
distribution F. Suppose that we are interested in some quantity that involves both F 
and X, for example, the bias of a statistic g(X) as an estimator of the median of F. 
The main idea behind bootstrap analysis in the simplest cases is the following: First, 
replace the unknown distribution F with a known distribution F. Next, let X* be a 
sample from the distribution F. Finally, compute the quantity of interest based on F 
and X*, for example, the bias of g(X*) as an estimator of the median of F. Consider 
the following overly simple example. 


The Variance of the Sample Mean. Let X = (Xj, ..., X,) be a random sample from 
a distribution with a continuous c.d.f. F. For the moment, we shall assume nothing 
more about F than that it has finite mean yz and finite variance 7. Suppose that we are 
interested in the variance of the sample mean X. We already know that this variance 
equals o”/n, but we do not know o?. In order to estimate o7/n, the bootstrap replaces 
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the unknown distribution F with a known distribution F’, which also has finite mean 
ji and finite variance 67. If X* = (X*,..., X;) is a random sample from F, then 
the variance of the sample mean X. = i yo, X* is 6?/n. Since the distribution F 
is known, we should be able to compute 67/n, and we can then use this value to 
estimate o7/n. 

One popular choice of known distribution F is the sample c.d.f. F, defined 
in Sec. 10.6. This sample c.d.f. is the discrete c.d.f. that has jumps of size 1/n at 
each of the observed values x,,..., x, of the random sample X),..., X,. So, if 
p. ta 0. Greer x*) is arandom sample from F, then each x? is a discrete random 
variable with the p.f. 


7O= | 1 if x € {xy,..., X,}, 


0 otherwise. 


It is relatively simple to compute the variance 6? of a random variable X * whose p.f. 
is f. The variance is 


An 
at 2 
ao =— ) (xj - x), 


where x is the average of the observed values x1, ..., x,,. Thus, our bootstrap estimate 
of the variance of X is 6?/n. < 


The key step in a bootstrap analysis is the choice of the known distribution F. 
The particular choice made in Example 12.6.1, namely, the sample c.d.f., leads to what 
is commonly called the nonparametric bootstrap. The reason for this name is that we 
do not assume that the distribution belongs to a parametric family when choosing 
F = F.,,. If we are willing to assume that F belongs to a parametric family, then we 
can choose F to be a member of that family and perform a parametric bootstrap 
analysis as illustrated next. 


The Variance of the Sample Mean. Let X = (X),..., X,,) be arandom sample from the 
normal distribution with mean yu and variance o7. Suppose, as in Example 12.6.1, that 
we are interested in estimating o7/n, the variance of the sample mean X. To apply 
the parametric bootstrap, we replace F by F, a member of the family of normal 
distributions. For this example, we shall choose F to be the normal distribution 
with mean and variance equal to the M.L.E.’s ¥ and 6?, respectively, although other 
choices could be made. We then estimate o2/n by the variance of the sample mean X- 
of a random sample from the distribution F. The variance of X” is easily computed 
as G*/n. In this case, the parametric bootstrap yields precisely the same answer as 
the nonparametric bootstrap. < 


In Examples 12.6.1 and 12.6.2, it was very simple to compute the variance of the 
sample mean of a random sample from the distribution F. In typical applications of 
the bootstrap, it is not so simple to compute the quantity of interest. For example, 
there is no simple formula for the variance of the sample median of a sample X* from 
F in Examples 12.6.1 and 12.6.2. In such cases, one resorts to simulation techniques in 
order to approximate the desired calculation. Before presenting examples of the use 
of simulation in the bootstrap, we shall first describe the general class of situations 
in which bootstrap analysis is used. 
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Table 12.10 Correspondence between statistical model and bootstrap analysis 


Statistical model Bootstrap 


Distribution 
Data 
Function of interest 


Parameter/estimate 


Unknown F Known F 
ii.d. sample X from F iid. sample X* from F 
1(X, F) n(X*, F) 


Mean, median, variance, etc. of n(X, F) Mean, median, variance, etc. of n(X%, F ) 


Example 
12.6.3 


The Bootstrap in General 


Let n(X, F) be a quantity of interest that possibly depends on both a distribution F 
and a sample X drawn from F’.. For example, if the distribution F has the p.d.f. f, we 
might be interested in 


n 2 
1 
n(X, F) = E dX Kee / xf (x) a] ; (12.6.1) 


In Examples 12.6.1 and 12.6.2, we wanted the variance of the sample average, which 
equals the mean of the quantity in Eq. (12.6.1). In general, we might wish to estimate 
the mean or a quantile or some other probabilistic feature of n(X, F). The bootstrap 
estimates the mean or a quantile or some other feature of n(X, F) by the mean or 
quantile or the other feature of n(X*, F), where X* is a random sample drawn from 
the distribution F, and F is some distribution that we hope is close to F. Table 12.10 
shows the correspondence between the original statistical model for the data and the 
quantities that are involved in a bootstrap analysis. The function 7 of interest must be 
something that exists for all distributions under consideration and all samples from 
those distributions. Other quantities that might be of interest include the quantiles 
of the distribution of a statistic, the M.A.E. or M.S.E. of an estimator, the bias of an 
estimator, probabilities that statistics lie in various intervals, and the like. 

In the simple examples considered so far, the distribution of n(X™, F ) was both 
known and easy to compute. It will often be the case that the distribution of n(X*, F) 
is too complicated to allow analytic computation of its features. In such cases, one 
approximates the bootstrap estimate using simulation. First, draw a large number 
(say, v) of random samples X*, ..., X*®) from the distribution F and then compute 
T® = n(X*, F) fori =1,..., v. Finally, compute the desired feature of the sample 
c.d.f. of the values T®,..., 7. 


The M.S.E. of the Sample Median. Suppose that we model our data X¥ = (Xj,..., X,) 
as coming from some continuous distribution with the c.d.f. F having median 6. 
Suppose also that we are interested in using the sample median M as an estimator 
of 6. We would like to estimate the M.S.E. of M as an estimator of @. That is, let 
n(X, F) = (M — 6)”, and try to estimate the mean of 7(X, F). Let F be a known 
distribution that we hope is similar to F, and let X* be arandom sample of size n from 
F. Regardless of what distribution F we choose, it is very difficult to compute the 
bootstrap estimate, the mean of n(X™, F ). Instead, we would simulate a large number 
v of samples X*, ..., X*) with the distribution F and then compute the sample 
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Figure 12.11 Sample me- 
dians of 10,000 bootstrap 
samples in Example 12.6.3. 


Count 4 


2500 -- 


=055: 0 0.5 1.0 


Sample median 


median of each sample M®,..., M)., Then we compute T” = (M — 6)? for i = 
1,..., v, where 6 is the median of the distribution F. Our simulation approximation 
to the bootstrap estimate is then the average of the values T,..., T™. 

As an example, suppose that our sample consists of the n = 25 values y,, ..., yas 
listed in Table 10.33 on page 662. For a nonparametric bootstrap analysis, we would 
use F = F.,, which is also listed in Table 10.33. Notice that the median of the distri- 
bution F is the sample median of the original sample, 6 = 0.40. Next, we simulate 
v =10,000 random samples of size 25 from the distribution F’. This is done by select- 
ing 25 numbers with replacement from the y; values and repeating for a total of 10,000 
samples of size 25. (Solve Exercise 2 to show why this provides the desired samples 
X*),..., X*),) For example, here is one of the 10,000 bootstrap samples: 


1.64 0.88 0.70 —1.23 —0.15 1.40 0.07 2.46 2.46 0.10 
—0.15 1.62 0.27 0.44 —-0.42 —2.46 1.40 —0.10 0.88 0.44 
—1.23 1.07 0.81 —0.02 1.62 


If we sort the numbers in this sample, we find that the sample median is 0.27. In 
fact, there were 1485 bootstrap samples out of 10,000 that had sample median equal 
to 0.27. Figure 12.11 contains a histogram of all 10,000 sample medians from the 
bootstrap samples. The four largest and four smallest observations in the original 
sample never appeared as sample medians in the 10,000 bootstrap samples. For 
each of the 10,000 bootstrap samples i, we compute the sample median M“ and 
its squared error 7“ = (M“ — 6)?, where 6 = 0.40 is the median of the distribution 
F. We then average all of these values over the 10,000 samples and obtain the value 
0.0887. This is our simulation approximation to the nonparametric bootstrap estimate 
of the M.S.E. of the sample median. The sample variance of the simulated T” 
values is 6* = 0.0135, and the simulation standard error of the bootstrap estimate 
is 6 /./10, 000 = 1.163 x 10-?. < 


Note: Simulation Approximation of Bootstrap Estimates. The bootstrap is an es- 
timation technique. As such, it produces estimates of parameters of interest. When 
a bootstrap estimate is too difficult to compute, we resort to simulation. Simulation 
provides an estimator of the bootstrap estimate. In this text, we shall refer to the 
simulation estimator of a bootstrap estimate as an approximation. We do this merely 
to avoid having to refer to estimators of estimates. 
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The bootstrap was introduced by Efron (1979), and there have been many ap- 
plications since then. Readers interested in more detail about the bootstrap should 
see Efron and Tibshirani (1993) or Davison and Hinkley (1997). Young (1994) gives 
a review of much of the literature on the bootstrap and contains many useful ref- 
erences. In the remainder of this section, we shall present several examples of both 
the parametric and nonparametric bootstraps and illustrate how simulation is used 
to approximate the desired bootstrap estimates. 


The Nonparametric Bootstrap 


Confidence Interval for the Interquartile Range. The interquartile range (IQR) of a 
distribution was introduced in Definition 4.3.2. It is defined to be the difference 
between the upper and lower quartiles, the 0.75 and 0.25 quantiles. The central 
50 percent of the distribution lies between the lower and upper quartiles, so the 
IQR is the length of the interval that contains the middle half of the distribution. 
For example, if F is the normal distribution with variance o, then the IOR is 1.350. 

Suppose that we desire a 90 percent confidence interval for the IQR @ of the 
unknown distribution F from which we have arandom sample Xj, ..., X,,. There are 
many ways to form confidence intervals, so we shall restrict attention to those that are 
based on the relationship between @ and the sample IOR @. Since the IOR is a scale 
feature, it might be reasonable to base our confidence interval on the distribution of 
6/0. That is, let the 0.05 and 0.95 quantiles of the distribution of 6/8 be a and b, so 


that 
r(§ < : < ) =0.9. 


Because a < 6/0 <b is equivalent to 6/b <0< 6/a, we conclude that (6/b, 6/a) is 
a 90 percent confidence interval for 6. The nonparametric bootstrap can be used 
to estimate the quantiles a and b as follows: Let n(X, F) = 6 /0 be the ratio of the 
sample IQR of the sample X to the IOR of the distribution F. Let F = F,, and notice 
that the IOR of F is @, the sample IOR. Next, let X* be a sample of size n from F. 
Let 6* be the sample IQR calculated from X*, so that n(X*, F) = 6*/6. The 0.05 
and 0.95 quantiles of the distribution of 7(X, F) are estimated by the 0.05 and 0.95 
quantiles of the distribution of n(X*, F’). These last quantiles, in turn, are typically 
approximated by simulation. We simulate a large number, say, v, of bootstrap samples 
X*® fori =1,..., v. For each bootstrap sample i, we compute the sample IQR 6* 
and divide it by @. Call the ratio 7. The g quantile of 6*/@ is approximated by the 
sample g quantile of the sample T, ... , 7). The confidence interval constructed 
by this method is called a percentile bootstrap confidence interval. 

We can illustrate this with the data in Table 10.33 on page 662. The IOR of the 
distribution F,, is 1.46, the difference between the 19th and 6th observations. We 
simulate 10,000 random samples of size 25 from the distribution F,,. For the ith sam- 
ple, we compute the sample IOR 6*“ and divide it by 1.46 to obtain 7. The 500th 
and 9500th ordered values from T®, ... , 79:9 are 0.5822 and 1.6301. We then 
compute the percentile bootstrap confidence interval (1.46/1.6301, 1.46/0.5822) = 
(0.8956, 2.5077). < 


Confidence Interval for a Location Parameter. Let X;,..., X, be a random sample 
from the distribution F'. Suppose that we want a confidence interval for the median 
6 of F. We can base a confidence interval on the sample median M. For example, 
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our interval could be of the form [M — cy, M + cp]. Since M —c,; <9 <M + cy is 
equivalent to —c) < M — 6 < cy, we might want —c> and c, to be quantiles of the 
distribution of M — 6. Without making assumptions about the distribution F, it 
might be very difficult to approximate quantiles of the distribution of M — 6. To 
compute a percentile bootstrap confidence interval, let n(X, F) = M — @ and then 
approximate quantiles (such as ag/2 and 1 — ag/2) of the distribution of n(X, F) 
by the corresponding quantiles of n(X*, F). Here, F is the sample c.d.f£, F,,, whose 
median is M, and X* is a random sample from F’. We then choose a large number v 


and simulate many samples X* for i =1,..., v. For each sample, we compute the 
sample median M*“) and then find the sample quantiles of the values M*“ — M for 
i (GB < 


How well the percentile bootstrap interval performs in Example 12.6.5 depends 
on how closely the distribution of M* — M approximates the distribution of M — 
6. (Here, M* is the median of a sample X* of size n from F.) The situation of 
Example 12.6.5 is one in which there is a possible improvement to the approximation. 
One thing that can make the distribution of M* — M different from the distribution 
of M — @ is that one of these distributions is more or less spread out than the other. 
We can use a different bootstrap approximation that suffers less from differences in 
spread. Instead of constructing an interval of the form [M — c,, M + co], we could 
let our interval be [M — d,Y, M + d,Y], where Y is a statistic that measures the 
spread of the data. One possibility for Y is the sample IQR. Another possible spread 
measure is the sample median absolute deviation (the sample median of the values 
|X,;—M|,..., |X, — M|). Now, we see that M —d,Y <0 < M+4,Y is equivalent to 


—dy < < dj. 

So, we want —d) and d, to be quantiles of the distribution of (M — @)/Y. This type of 
interval resembles the t confidence interval developed in Sec. 8.5. Indeed, the interval 
we are constructing is called a percentile-t bootstrap confidence interval. To construct 
the percentile-t bootstrap confidence interval, we would use each bootstrap sample 
X™* as follows: Compute the sample median M* and the scale statistic Y* from the 
bootstrap sample X*. Then calculate T = (M* — M)/Y*. Repeat this procedure many 
times producing T,..., T from a large number v of bootstrap samples. Then let 
—d, and d, be sample quantiles (such as ag/2 and 1 — ag/2) of the T values. 


Percentile-t Confidence Interval for a Median. Consider the n = 10 lactic acid con- 
centrations in cheese from Example 8.5.4. We shall do v =10,000 bootstrap simu- 
lations to find a coefficient 1— ag = 0.90 confidence interval for the median lactic 
acid concentration 6. The median of the sample values is M = 1.41, and the me- 
dian absolute deviation is Y = 0.245. The 0.05 and 0.95 sample quantiles of the 
(M* — M)/Y* values are —2.133 and 1.581. This makes the percentile-r boot- 
strap confidence interval (1.41 — 1.581 x 0.245, 1.41 + 2.133 x 0.245) = (1.023, 1.933). 
For comparison, the 0.05 and 0.95 sample quantiles of the values of M*“) — M are 
—0.32 and 0.16, respectively. This makes the percentile bootstrap interval equal to 
(1.41 — 0.16, 1.41 + 0.32) = (1.25, 1.73). < 


The percentile-t interval in Example 11.5.6 is considerably wider than the per- 
centile interval. This reflects the fact that the Y* values from the bootstrap samples 
are quite spread out. This in turn suggests that the spread that we should expect to 
see in a sample has substantial variability. Hence, it is probably not a good idea to 
assume that the spread of the distribution of M* — M is the same as the spread of 


Example 
12.6.7 


Example 
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the distribution of M — 6. The percentile-t bootstrap interval is generally preferred 
to the percentile bootstrap interval when both are available. This is due to the fact 
that the distribution of (M* — M)/Y* depends less on F than does the distribution 
of M* — M. In particular, (M* — M)/Y* does not depend on any scale parameter of 
the distribution F’. For this reason, we expect more similarity between the distribu- 
tions of (M* — M)/Y* and (M — 6)/Y than we expect between the distributions of 
M* — M and M — 0. 


Features of the Distribution of a Sample Correlation. Let (X, Y) have a bivariate joint 
distribution F with finite variances for both coordinates, so that it makes sense to talk 
about correlation. Suppose that we observe a random sample (X,, Y}),..., (Xn, Yn) 
from the distribution F'. Suppose further that we are interested in the distribution of 
the sample correlation: 


” (X,-—X)(Y; —Y 
R= dint %i — HM — ¥) (12.6.2) 


(re = xy] [Dr : vy)" 


We might be interested in the variance of R, or the bias of R, or some other feature of 
R as an estimator of the correlation p between X and Y. Whatever our goal is, we can 
make use of the nonparametric bootstrap. For example, consider the bias of R as an 
estimator of p. This bias equals the mean of n(X, F) = R — p. We begin by replacing 
the joint distribution F by the sample distribution F,, of the observed pairs. This F,, 
is a discrete joint distribution on pairs of real numbers, and it assigns probability 1/n 
to each of the n observed sample pairs. If (X*, Y*) has the distribution F,,, itis easy to 
check (see Exercise 8) that the correlation between X* and Y* is R. We then choose 
a large number v and simulate v samples of size n from F,,. For each i, we compute 
the sample correlation R® by plugging the ith bootstrap sample into Eq. (12.6.2). 
For each i, we compute 7 = R® — R, and we estimate the mean of R — p by the 
average + v?_, T®, 

As a numerical example, consider the flea beetle data from Example 5.10.2. The 
sample correlation is R = 0.6401. We sample v = 10,000 bootstrap samples of size 
n = 31. The average sample correlation in the 10,000 bootstrap samples is 0.6354 
with a simulation standard error of 0.001. We then estimate the bias of the sample 
correlation to be 0.6354 — 0.6401 = —0.0047. < 


The Parametric Bootstrap 


Correcting the Bias in the Coefficient of Variation. The coefficient of variation of a 
distribution is the ratio of the standard deviation to the mean. (Typically, people only 
compute the coefficient of variation for distributions of positive random variables.) 
If we believe that our data X,,..., X, come from a lognormal distribution with 
parameters jz and o“, then the coefficient of variation is 0 = (e™ —1)'/2. The M.L.E. 
of the coefficient of variation is 6 = (e —1)!/2, where 6 is the M.LE. of o. We 
expect the M.L.E. of the coefficient of variation to be a biased estimator because 
it so nonlinear. Computing the bias is a difficult task. However, we can use the 
parametric bootstrap to estimate the bias. The M.L.E. 6 of o is the square root of the 
sample variance of log(X,), ..., log(X,,). The M.L.E. “4 of jz is the sample average of 
log(X,),..., log(X,,). We can simulate a large number of random samples of size n 
from the lognormal distribution with parameters ji and 67. For each i, we compute 
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o*), the sample standard deviation of the ith bootstrap sample. We estimate the bias 
of 6 by the sample average of the values T = (el@°°P — 1/2 — 6. 

As an example, consider the failure times of ball bearings introduced in Ex- 
ample 5.6.9. If we model these data as lognormal, the M.L.E.’s of the parameters 
are fi = 4.150 and 6 = 0.5217. The M.L.E. of @ is 6 = 0.5593. We could draw 10,000 
random samples of size 23 from a lognormal distribution and compute the sample 
variances of the logarithms. However, there is an easier way to do this simulation. 
The distribution of [6*“ is that of a x* random variable with 22 degrees of free- 
dom times 0.5217°/23. Hence, we shall just sample 10,000 x? random variables with 
22 degrees of freedom, multiply each one by 0.52177 /23, and call the ith one [6* P. 
After doing this, the sample average of the 10,000 7 values is —0.01825, which is 
our parametric bootstrap estimate of the bias of 9. (The simulation standard error is 
9.47 x 10-4.) Because our estimate of the bias is negative, this means that we expect 
6 to be smaller than @. To “correct” the bias, we could add 0.01825 to our original 
estimate 9 and produce the new estimate 0.5593 + 0.01825 = 0.5776. < 


Estimating the Standard Deviation of a Statistic. Suppose that X;,..., X,, isa random 
sample from the normal distribution with mean jz and variance o”. We are interested 
in the probability that a random variable having this same distribution is at most 
c. That is, we are interested in estimating 6 = ®([c — u]/o). The M.L.E. of @ is 
6 = ®([c — X]/6). It is not easy to calculate the standard deviation of @ in closed 
form. However, we can draw many, say, v, bootstrap samples of size n from the 
normal distribution with mean ¥ and variance 67. For the ith bootstrap sample, 
we compute a sample average x*“, a sample standard deviation 6*”), and, finally, 
6* = &([e —¥*1/6*). We estimate the mean of 6 by 


1 Vv 
0 =- > 6), 
ee 
(This can also be used, as in Example 12.6.8, to estimate the bias of 6.) The standard 
deviation of @ can then be estimated by the sample standard deviation of the 6* 


values, 


v 1/2 
Z= (2 = ) . 
UV 
i=] 


For example, we can use the nursing home data from Sec. 8.6. There are n = 18 
observations, and we might be interested in ®([200 — w]/o). The M.L.E.’s of « anda 
are (i = 182.17 and 6 = 72.22. The observed value of 6 is &({200 — 182.17]/72.22) = 
0.5975. We simulate 10,000 samples of size 18 from the normal distribution with 
mean 182.17 and variance (72.22)?. For the ith sample, we find the value 6*” for 
i=1,..., 10,000, and the average of these is @ = 0.6020 with sample standard 
deviation Z = 0.09768. 

We can compute the simulation standard error of the approximation to the 
bootstrap estimate in two steps. First, apply the method of Example 12.2.10. This gives 
the simulation standard error of Z?, the sample variance of the 6*“’s. In our example, 
this yields the value 1.365 x 10~*. Second, use the delta method, as in Example 12.2.8, 
to find the simulation standard error of the square root of Z*. In our example, this 
second step yields the value 6.986 x 10~4. «J 
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Comparing Means When Variances Are Unequal. Suppose that we have two samples 


X1,...,X, and Y;,..., Y, from two possibly different normal distributions. That 
is, Xj,..., X,, are i.i.d. from the normal distribution with mean j1; and variance one 
while Y;,..., Y,, are i.i.d. from the normal distribution with mean jy and variance 


oS, In Sec. 9.6, we saw how to test the null hypothesis Hp: 1 = 42 versus the 
alternative hypothesis Hy : 4, 4 42 if we are willing to assume that we know the ratio 
k= a5 jor. If we are not willing to assume that we know the ratio k, we have seen 
only approximate tests. 

Suppose that we choose to use the usual two-sample ¢ test even though we do not 
claim to know k. That is, suppose that we choose to reject Hy when |U| > c, where U 
is the statistic defined in Eq. (9.6.3) and c is the 1 — ag/2 quantile of the ¢ distribution 
with m + n — 2 degrees of freedom. This test will not necessarily have level ag ifk 4 1. 
We can use the parametric bootstrap to try to compute the level of this test. In fact, 
we can use the parametric bootstrap to help us choose a different critical value c* for 
the test so that we at least estimate the type I error probability to be ap. 

As an example, we shall use the data from Example 9.6.5 again. The M.L.E.’s of 
the variances of the two distributions were aq = 0.04 (for the X data) and a3 = 0.022 
(for the Y data). The probability of type I error is the probability of rejecting the null 
hypothesis given that the null hypothesis is true, that is, given that 1 = 42. Hence, 
we must simulate bootstrap samples in which the X data and Y data have the same 
mean. Since the sample averages of the X and Y data are subtracted from each other 
in the calculation of U, it will not matter what common mean we choose for the two 


samples. 
So, the parametric bootstrap can proceed as follows: First, choose a large number 
v, and for i=1,..., v, simulate xe, aaa ua sy where all four random 


variables are independent with the following distributions: 


¢ X” has the normal distribution with mean 0 and variance of /m. 


Y~” has the normal distribution with mean 0 and variance a /n. 


° eee is 6t times a random variable having the x? distribution with m — 1 


degrees of freedom. 


° co is Ge times a random variable having the x? distribution with n — 1 degrees 


of freedom. 


Then compute 


=~ 1/2 5\ 1/2 
1 1 2*(i) 2*(i) 
(m+n) (Se +5>°) 


for each i. Our simulation approximation to the bootstrap estimate of the probability 
of type I error for the usual two-sample t test would be the proportion of simulations 
in which |U| > c. 

With v = 10,000, we shall perform the analysis described above for several 
different c values. We set c equal to the 1—ap/2 quantile of the ¢ distribution 
with 16 degrees of freedom with ag = 7/1000 for each j = 1,..., 999. Figure 12.12 
shows a plot of the simulation approximation to the bootstrap estimate of the type I 
error probability against the nominal level ag of the test. There is remarkably close 
agreement between the two, although the bootstrap estimate is generally slightly 
larger. For example, when ag = 0.05, the bootstrap estimate is 0.065. 
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Figure 12.12 Plots of boot- 
strap estimated type I error 
probability of ¢ test versus 
nominal type I error prob- 
ability in Example 12.6.10. 
The dashed line is the diago- 
nal along which the two error 
probabilities would be equal. 
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Next, we use the bootstrap analysis to correct the level of the two-sample r test 
in this example. To do this, let Z be the sample 1 — a quantile of our simulated |U | 
values. If we want a level ap test, we can replace the critical value c in the two-sample 
t test with Z and reject the null hypothesis if |U| > Z. For example, with ag = 0.05, 
the 0.975 quantile of the f distribution is 2.12, while in our simulation Z = 2.277. The 
simulation standard error of Z (based on splitting the 10,000 bootstrap samples into 
10 subsamples of 1000 each) is 0.0089. < 


The Bias of the Sample Correlation. In Example 12.6.7, we made no assumptions about 
the distribution F of (X, Y) except that X and Y have finite variances. Now suppose 
that we also assume that (X, Y) has a bivariate normal distribution. We can compute 
the M.L.E.’s of all of the parameters as in Exercise 24 in Sec. 7.6. We could then 
simulate v samples of size n from the bivariate normal distribution with parameters 
equal to the M.L.E.’s, as in Example 12.3.6. For sample i fori =1,..., v, we could 
compute the sample correlation R“ by substituting the ith sample into Eq. (12.6.2). 
Our estimate of the bias would be R — (. Note that /, the M.L.E. of p, is the same 
as R. 

As a numerical example, consider the flea beetle data from Example 5.10.2. The 
sample correlation is R = 0.6401. We construct v = 10,000 samples of size n = 31 from 
a bivariate normal distribution with correlation 0.6401. The means and variances do 
not affect the distribution of R. (See Exercise 12.) The average sample correlation 
in the 10,000 bootstrap samples is 0.6352 with a simulation standard error of 0.001. 
We then estimate the bias of the sample correlation to be 0.6352 — 0.6401 = —0.0049. 
This is pretty much the same as we obtained using the nonparametric bootstrap in 
Example 12.6.7. 4 


Summary 


The bootstrap is a method for estimating probabilistic features of a function n of our 
data X and their unknown distribution F’. That is, suppose that we are interested in 
the mean, a quantile, or some other feature of n(X, F). The first step in the bootstrap 
is to replace F by a known distribution F that is like F in some way. Next, replace 
X by data X* sampled from F. Finally, compute the mean, quantile, or other feature 
of n(X*, F) as the bootstrap estimate. This last step generally requires simulation 
except in the simplest examples. There are two varieties of bootstrap that differ by 
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how F is chosen. In the nonparametric bootstrap, the sample c.d.f. is used as F. In 
the parametric bootstrap, F is assumed to be a member of some parametric family 
and F is chosen by replacing the unknown parameter by its M.L.E. or some other 


estimate. 


Exercises 


1. Suppose that X;,..., X,, form a random sample from 
an exponential distribution with parameter @. Explain 
how to use the parametric bootstrap to estimate the vari- 
ance of the sample average X. (No simulation is required.) 


2. Let x1,...,x, be the observed values of a random 
sample X = (Xj,..., X,,). Let F,, be the sample c.d.f. Let 
J;,..., J, be a random sample with replacement from 
the numbers {1,..., }. Define X} = xy, fori=1,...,n. 
Show that X* = (X*,..., X*) is an ii.d. sample from the 
distribution F,,. 


3. Let n be odd, and let X¥ = (Xj,..., X,,) be a sample 
of size n from some distribution. Suppose that we wish to 
use the nonparametric bootstrap to estimate some feature 
of the sample median. Compute the probability that the 
sample median of a nonparametric bootstrap sample will 
be the smallest observation from the original data X. 


4. Use the data in the first column of Table 11.5 on 
page 699. These data give the boiling points of water at 17 
different locations from Forbes’ experiment. Let F be the 
distribution from which these boiling points were drawn. 
We might not be willing to make many assumptions about 
F. Suppose that we are interested in the bias of the sample 
median as an estimator of the median of the distribution 
F. Use the nonparametric bootstrap to estimate this bias. 
First, do a pilot run to compute the simulation standard 
error of the simulation approximation, and then see how 
many bootstrap samples you need in order for your bias 
estimate (for distribution F’) to be within 0.02 of the true 
bias (for distribution F’) with probability at least 0.9. 


5. Use the data in Table 10.6 on page 640. We are inter- 
ested in the bias of the sample median as an estimator of 
the median of the distribution. 


a. Use the nonparametric bootstrap to estimate this 
bias. 


b. How many bootstrap samples does it appear that you 
need in order to estimate the bias to within .05 with 
probability 0.99? 


6. Use the data in Exercise 16 of Sec. 10.7. 


a. Use the nonparametric bootstrap to estimate the 
variance of the sample median. 
b. How many bootstrap samples does it appear that you 


need in order to estimate the variance to within .005 
with probability 0.95? 


7. Use the blood pressure data in Table 9.2 that was de- 
scribed in Exercise 10 of Sec. 9.6. Suppose now that we are 
not confident that the variances are the same for the two 
treatment groups. Perform a parametric bootstrap analy- 
sis of the sort done in Example 12.6.10. Use v =10,000 
bootstrap simulations. 


a. Estimate the probability of type I error for a two- 
sample ¢ test whose nominal level is ag = 0.1. 


b. Correct the level of the two-sample t test by comput- 
ing the appropriate quantile of the bootstrap distri- 
bution of |U|. 

c. Compute the simulation standard error for the quan- 
tile in part (b). 


8. In Example 12.6.7, let (X*, Y*) be arandom draw from 
the sample distribution F,,. Prove that the correlation be- 
tween X* and Y* is R in Eq. (12.6.2). 


9. Use the data on fish prices in Table 11.6 on page 707. 
Suppose that we assume only that the distribution of fish 
prices in 1970 and 1980 is a continuous joint distribution 
with finite variances. We are interested in the properties 
of the sample correlation coefficient. Construct 1000 non- 
parametric bootstrap samples for solving this exercise. 


a. Approximate the bootstrap estimate of the variance 
of the sample correlation. 


b. Approximate the bootstrap estimate of the bias of 
the sample correlation. 


c. Compute simulation standard errors of each of the 
above bootstrap estimates. 


10. Use the beef hot dog data in Exercise 7 of Sec. 8.5. 
Form 10,000 nonparametric bootstrap samples to solve 
this exercise. 


a. Approximate a 90 percent percentile bootstrap con- 
fidence interval for the median calorie count in beef 
hot dogs. 


b. Approximate a 90 percent percentile-t bootstrap 
confidence interval for the median calorie count in 
beef hot dogs. 


c. Compare these intervals to the 90 percent interval 
formed using the assumption that the data came from 
a normal distribution. 
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11. The skewness of a random variable was defined in 
Definition 4.4.1. Suppose that X,,..., X,, form a random 
sample from a distribution F. The sample skewness is 
defined as 


ad n 
n i=1 


[pons -xe] 


(X; — x) 


One might use M3 as an estimator of the skewness 6 of 
the distribution F. The bootstrap can be used to estimate 
the bias and standard deviation of the sample skewness as 
an estimator of 0. 


a. Prove that M3; is the skewness of the sample distribu- 
tion F,,. 

b. Use the 1970 fish price data in Table 11.6 on page 707. 
Compute the sample skewness, and then simulate 
1000 bootstrap samples. Use the bootstrap samples 
to estimate the bias and standard deviation of the 
sample skewness. 


12. Suppose that (Xj, Y;),..., (X,, Y,) form a random 
sample from a bivariate normal distribution with means 
Hl, and 2, variances o and o, and correlation p. Let R 
be the sample correlation. Prove that the distribution of 


2 2 
R depends only on p, not on p,, Ly, 0%, OF OF 
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1. Test the standard normal pseudo-random number gen- 
erator on your computer by generating a sample of size 
10,000 and drawing a normal quantile plot. How straight 
does the plot appear to be? 


2. Test the gamma pseudo-random number generator on 
your computer. Simulate 10,000 gamma pseudo-random 
variables with parameters a and | for a = 0.5, 1, 1.5, 2, 5, 
10. Then draw gamma quantile plots. 


3. Test the f pseudo-random number generator on your 
computer. Simulate 10,000 t pseudo-random variables 
with m degrees of freedom for m = 1, 2, 5, 10, 20. Then 
draw t quantile plots. 


4. Let X and Y be independent random variables with X 
having the ¢ distribution with five degrees of freedom and 
Y having the r distribution with three degrees of freedom. 
We are interested in E(|X — Y|). 


a. Simulate 1000 pairs of (X;, Y;) each with the above 
joint distribution and estimate E(|X — Y|). 


b. Use your 1000 simulated pairs to estimate the vari- 
ance of |X — Y| also. 


c. Based on your estimated variance, how many sim- 
ulations would you need to be 99 percent confident 
that your estimator of E(|X — Y|) is within 0.01 of the 
actual mean? 


5. Consider the power calculation done in Example 9.5.5. 


a. Simulate vg = 1000i.i.d. noncentral t pseudo-random 
variables with 14 degrees of freedom and noncentral- 
ity parameter 1.936. 


b. Estimate the probability that a noncentral t random 
variable with 14 degrees of freedom and noncentral- 
ity parameter 1.936 is at least 1.761. Also, compute 
the simulation standard error. 


c. Suppose that we want our estimator of the noncen- 
tral t probability in part (b) to be closer than 0.01 to 


the true value with probability 0.99. How many non- 
central t random variables do we need to simulate? 


6. The x? goodness-of-fit test (see Chapter 10) is based 
on an asymptotic approximation to the distribution of the 
test statistic. For small to medium samples, the asymptotic 
approximation might not be very good. Simulation can 
be used to assess how good the approximation is. Simu- 
lation can also be used to estimate the power function of 
a goodness-of-fit test. For this exercise, assume that we are 
performing the test that was done in Example 10.1.6. The 
idea illustrated in this exercise applies in all such problems. 


a. Simulate v =10,000 samples of size n = 23 from the 
normal distribution with mean 3.912 and variance 
0.25. For each sample, compute the x? goodness- 
of-fit statistic Q using the same four intervals that 
were used in Example 10.1.6. Use the simulations 
to estimate the probability that Q is greater than or 
equal to the 0.9, 0.95, and 0.99 quantiles of the x? 
distribution with three degrees of freedom. 


b. Suppose that we are interested in the power function 
of a x* goodness-of-fit test when the actual distribu- 
tion of the data is the normal distribution with mean 
4.2 and variance 0.8. Use simulation to estimate the 
power function of the level 0.1, 0.05, and 0.01 tests at 
the alternative specified. 


7. In Sec. 10.2, we discussed x? goodness-of-fit tests for 
composite hypotheses. These tests required computing 
M.L.E.’s based on the numbers of observations that fell 
into the different intervals used for the test. Suppose 
instead that we use the M.L.E.’s based on the original 
observations. In this case, we claimed that the asymp- 
totic distribution of the x? test statistic was somewhere 
between two different x? distributions. We can use sim- 
ulation to better approximate the distribution of the test 
statistic. In this exercise, assume that we are trying to test 


the same hypotheses as in Example 10.2.5, although the 
methods will apply in all such cases. 


a. Simulate v = 1000 samples of size n = 23 from each 
of 10 different normal distributions. Let the normal 
distributions have means of 3.8, 3.9, 4.0, 4.1, and 4.2. 
Let the distributions have variances of 0.25 and 0.8. 
Use all 10 combinations of mean and variance. For 
each simulated sample, compute the x? statistic Q 
using the usual M.L.E.’s of and o”. For each of the 
10 normal distributions, estimate the 0.9, 0.95, and 
0.99 quantiles of the distribution of Q. 


b. Do the quantiles change much as the distribution of 
the data changes? 


c. Consider the test that rejects the null hypothesis if 
Q > 5.2. Use simulation to estimate the power func- 
tion of this test at the following alternative: For each 
i, (X; — 3.912)/0.5 has the ¢ distribution with five de- 
grees of freedom. 


8. In Example 12.5.6, we used a hierarchical model. In 
that model, the parameters 11, ..., 4 p were independent 
random variables with jz; having the normal distribution 
with mean yw and precision i497; conditional on y and 
Tsseas Eps To make the model more general, we could 
also replace 4g by an unknown parameter A. That is, let 
the y;’s be independent with jw; having the normal dis- 
tribution with mean w and precision At; conditional on 
wv, A, and t%,...,7,. Let A have the gamma distribution 
with parameters yp and do, and let 4 be independent of w 
and 1, ..., t,. The remaining parameters have the prior 
distributions stated in Example 12.5.6. 


a. Write the product of the likelihood and the prior as 
a function of the parameters jy, ..., [ps Tis +++ Tps 
w, and i. 


b. Find the conditional distributions of each parameter 
given all of the others. Hint: For all the parameters 
besides A, the distributions should be almost identi- 
cal to those given in Example 12.5.6. Wherever io 
appears, of course, something will have to change. 


c. Use a prior distribution in which ag = 1, By = 0.1, 
ug = 0.001, vo = 59 = 1, and Wp = 170. Fit the model 
to the hot dog calorie data from Example 11.6.2. 
Compute the posterior means of the four j;’s and 
1/t; Ss. 


9. In Example 12.5.6, we modeled the parameters T;, ..., 
T, as Lid. having the gamma distribution with parameters 
ay and Bp. We could have added a level to the hierarchical 
model that would allow the 1;’s to come from a distribution 
with an unknown parameter. For example, suppose that 
we model the 1;’s as conditionally independent having the 
gamma distribution with parameters ap and given f. Let 
B be independent of w and y4,..., uw, with 6 having the 
gamma distribution with parameters €) and ¢p. The rest of 
the prior distributions are as specified in Example 12.5.6. 
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a. Write the product of the likelihood and the prior as 
a function of the parameters j4,..., [ps Ts +++ Tp 
w, and p. 

b. Find the conditional distributions of each parameter 
given all of the others. Hint: For all the parameters 
besides f, the distributions should be almost iden- 
tical to those given in Example 12.5.6. Wherever Bo 
appears, of course, something will have to change. 

c. Use a prior distribution in which ag = Ag = 1, up = 
0.001, €9 = 0.3, dp = 3.0, and wo = 170. Fit the model 
to the hot dog calorie data from Example 11.6.2. 
Compute the posterior means of the four ;’s and 
1/t; Ss. 


10. Let X;,..., X; be independent random variables 
such that X; has the binomial distribution with param- 
eters n; and p;. We wish to test the null hypothesis Hp : 
Pi =:+: =p; versus the alternative hypothesis H; that Ho 
is false. Assume that the numbers ny, ..., 1; are known 
constants. 


a. Show that the likelihood ratio test procedure is to 
reject Ho if the following statistic is greater than or 
equal to some constant c: 


X; _x, 
hs [x; "(nj — Xi)" x 


(Shu x)= mabsan cre pene 


b. Describe how you could use simulation techniques 
to estimate the constant c in order to make the like- 
lihood ratio test have a desired level of significance 
ao. (Assume that you can simulate as many binomial 
pseudo-random variables as you wish.) 


c. Consider the depression study in Example 2.1.4. Let 
p; stand for the probability of success (no relapse) 
for the subjects in group i of Table 2.1 on page 57, 
where i = | means imipramine, i = 2 means lithium, 
i =3 means combination, and i = 4 means placebo. 
Test the null hypothesis that py = po = p3 = pa by 
computing the p-value for the likelihood ratio test. 


11. Consider the problem of testing the equality of two 
normal means when the variances are unequal. This prob- 
lem was introduced on page 593 in Sec. 9.6. The data are 
two independent samples Xj,..., X,,, and Y;,..., Y,,. The 
X;’s are i.i.d. having the normal distribution with mean p21 


and variance oF, while the Y 7's are i.i.d. having the normal 


distribution with mean jy and variance a3. 


a. Assume that 2] = /49. Prove that the random variable 
V in Eq. (9.6.14) has a distribution that depends on 
the parameters only through the ratio 07/04. 

b. Let v be the approximate degrees of freedom for 
Welch’s procedure from Eq. (9.6.17). Prove that the 
distribution of v depends on the parameters only 
through the ratio 07/0}. 


852 Chapter 12 Simulation 


c. Use simulation to assess the approximation in 
Welch’s procedure. In particular, set the ratio 07/0, 
equal to each of the numbers 1, 1.5, 2, 3, 5, and 10 
in succession. For each value of the ratio, simulate 
10,000 samples of sizes n = 11 and m = 10 (or the 
appropriate summary statistics). For each simulated 
sample, compute the test statistic V and the 0.9, 0.95, 
and 0.99 quantiles of the approximate ¢ distribution 
that corresponds to the data in that simulation. Keep 
track of the proportion of simulations in which V 
is greater than each of the three quantiles. How do 
these proportions compare to the nominal values 0.1, 
0.05, and 0.01? 


12. Consider again the situation described in Exercise 11. 
This time, use simulation to assess the performance of the 
usual two-sample ¢ test. That is, use the same simulations 
as in part (c) of Exercise 11 (or ones just like them if you 
do not have the same simulations). This time, for each 
simulated sample compute the statistic U in Eq. (9.6.3) 
and keep track of the proportion of simulations in which U 
is greater than each of the nominal t quantiles, ae (1 — ap) 
for a =0.1, 0.05, and 0.01. How do these proportions 
compare to the nominal ap values? 


13. Suppose that our data comprise a set of pairs (Y;, x;), 
fori =1,...,n. Here, each Y; is a random variable and 
each x; is a known constant. Suppose that we use a simple 
linear regression model in which E(Y;) = Bp + 6,x;. Let 
A, stand for the least squares estimator of f;. Suppose, 
however, that the Y;’s are actually random variables with 
translated and scaled r distributions. In particular, suppose 
that (Y; — By — 61x;)/o are ii.d. having the ¢ distribution 
with k > 5 degrees of freedom for i=1,...,n. We can 
use simulation to estimate the standard deviation of the 
sampling distribution of ,. 
a. Prove that the variance of the sampling distribution 
of B; does not depend on the values of the parameters 
Bo and By. 
b. Prove that the variance of the sampling distribution 


of B; is equal to vo”, where v does not depend on any 
of the parameters fo, 6,, ando. 


c. Describe a simulation scheme to estimate the value 
v from part (b). 


14. Use the simulation scheme developed in Exercise 13 
and the data in Table 11.5 on page 699. Suppose that we 
think that the logarithms of pressure are linearly related 
to boiling point, but that the logarithms of pressure have 
translated and scaled ¢ distributions with k = 5 degrees of 
freedom. Estimate the value v from part (b) of Exercise 13 
using simulation. 


15. In Sec. 7.4, we introduced Bayes estimators. For sim- 
ple loss functions, such as squared error and absolute er- 
ror, we were able to derive general forms for Bayes es- 
timators. In many real problems, loss functions are not 
so simple. Simulation can often be used to approximate 
Bayes estimators. Suppose that we are able to simulate 
a sample 0, ..., 6) (either directly or by Gibbs sam- 
pling) from the posterior distribution of some parameter 
@ given some observed data X = x. Here, 6 can be real 
valued or multidimensional. Suppose that we have a loss 
function L(@, a), and we want to choose a so as to mini- 
mize the posterior mean E[L(0, a)|x]. 


a. Describe a general method for approximating the 
Bayes estimate in the situation described above. 


b. Suppose that the simulation variance of the approx- 
imation to the Bayes estimate is proportional to 1 
over the size of the simulation. How could one com- 
pute a simulation standard error for the approxima- 
tion to the Bayes estimate? 


16. In Example 12.5.2, suppose that the State of New 
Mexico wishes to estimate the mean number yz of medical 
in-patient days in nonrural nursing homes. The parame- 
ter is 0 = (4, T). The loss function will be asymmetric to 
reflect different costs of underestimating and overestimat- 
ing. Suppose that the loss function is 


30(a — pL) 
(u — a)? 


ifa>p, 
L(@,a) = | . 

ifu>a. 
Use the method developed in your solution to Exercise 15 
to approximate the Bayes estimate and compute a simu- 
lation standard error. 
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Tables 
Table of Binomial Probabilities 
Pr(X =k) = (j)p* — py" 
nek p=01 p=0.2 p=03 p=04 p=0.5 
2 0 .8100 .6400 .4900 3600 2500 
1.1800 3200 .4200 4800 5000 
2 ~~ .0100 .0400 .0900 .1600 2500 
3 0 .7290 5120 3430 .2160 .1250 
1.2430 3840 .4410 4320 3750 
2 ~~ .0270 .0960 .1890 2880 3750 
3 ~=—-.0010 .0080 .0270 .0640 .1250 
4 0 .6561 4096 2401 1296 .0625 
1 .2916 4096 .4116 3456 2500 
2  .0486 1536 2646 3456 3750 
3 ~—-.0036 .0256 .0756 1536 2500 
4.0001 .0016 .0081 .0256 .0625 
5 0 .5905 3277 1681 .0778 .0312 
1 = .3280 4096 3602 2592 1562 
2 .0729 2048 3087 3456 3125 
3.0081 .0512 1323 2304 3125 
4 .0005 .0064 0284 .0768 1562 
5.0000 .0003 .0024 .0102 .0312 
6 0 5314 2621 1176 .0467 .0156 
1 = 3543 3932 3025 .1866 .0938 
2 0984 2458 3241 3110 2344 
3. 0146 .0819 1852 2765 3125 
4 0012 .0154 0595 1382 2344 
5.0001 .0015 .0102 .0369 .0938 
6 ~~ .0000 .0001 .0007 .0041 .0156 
7 0  .4783 .2097 .0824 .0280 .0078 
1 = 3720 3670 2471 1306 .0547 
2 1240 2753 3176 2613 1641 
3.0230 1147 2269 2903 2734 
4.0026 .0287 .0972 1935 2734 
5.0002 .0043 .0250 .0774 1641 
6 ~~ .0000 .0004 .0036 .0172 .0547 
7.0000 .0000 .0002 .0016 .0078 


(continued) 
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Table of Binomial Probabilities (continued) 


~ 


p=01 p=02 p=03 p=04 p=0.5 


OMmAAINDMNFWNF DTD CMADNFWNrFP DTD ANDUNFWNF CO 


= 
=) 


4305 .1678 .0576 .0168 .0039 
3826 3355 1977 0896 0312 
1488 .2936 .2965 .2090 1094 
0331 1468 2541 .2787 .2188 
.0046 0459 1361 2322 2734 
.0004 .0092 .0467 1239 2188 
0000 0011 .0100 0413 1094 
.0000 0001 0012 .0079 0312 
.0000 .0000 .0001 .0007 .0039 


3874 1342 0404 0101 .0020 
3874 3020 1556 .0605 .0176 
1722 3020 .2668 1612 0703 
0446 1762 .2668 .2508 1641 
.0074 .0661 A715 .2508 2461 
.0008 .0165 0735 1672 2461 
0001 .0028 .0210 .0743 1641 
.0000 .0003 .0039 0212 0703 
.0000 .0000 .0004 .0035 .0176 
.0000 .0000 0000 .0003 .0020 


3487 1074 .0282 .0060 .0010 
3874 .2684 1211 0403 .0098 
1937 3020 .2335 .1209 0439 
0574 .2013 .2668 2150 1172 
0112 .0881 .2001 .2508 2051 
0015 .0264 .1029 .2007 2461 
.0001 .0055 .0368 1115 2051 
.0000 .0008 .0090 0425 1172 
.0000 .0001 .0014 .0106 0439 
.0000 .0000 0001 .0016 .0098 
.0000 .0000 .0000 .0001 .0010 
(continued) 
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Table of Binomial Probabilities (continued) 


nek p=01 p=0.2 p=03 p=04 p=0.5 
15 0  .2059 0352 .0047 .0005 .0000 
1 3432 1319 0305 0047 0005 
2 ~~ .2669 .2309 .0916 0219 .0032 
3. 1285 2501 .1700 .0634 0139 
4.0428 .1876 .2186 1268 0417 
5.0105 1032 .2061 1859 .0916 
6 ~~ .0019 .0430 1472 .2066 1527 
7.0003 .0138 0811 A771 1964 
8 — .0000 .0035 0348 1181 1964 


9 ~~ .0000 .0007 .0116 .0612 1527 
10 ~—.0000 0001 .0030 .0245 .0916 
11 = _.0000 .0000 .0006 .0074 0417 
12.0000 .0000 0001 .0016 0139 
13.0000 .0000 .0000 .0003 .0032 
14 ~~ _.0000 .0000 .0000 .0000 .0005 
15.0000 .0000 .0000 .0000 .0000 


20 0  .1216 0115 .0008 .0000 0000 
1.2701 .0576 .0068 .0005 .0000 
2 ~~ 2852 1369 .0278 .0031 0002 
3. 1901 2054 .0716 0123 0011 
4 0898 2182 1304 .0350 .0046 
5.0319 1746 1789 .0746 0148 
6 ~~ .0089 1091 1916 1244 .0370 
7 ~~ 0020 0545 1643 1659 0739 
8 — .0003 0222 1144 1797 1201 
9 ~~ 0001 .0074 .0654 1597 1602 
10 ~—.0000 .0020 .0308 171 1762 
11.0000 .0005 .0120 0710 .1602 
12.0000 0001 .0039 .0355 1201 
13.0000 .0000 .0010 0146 0739 
14 0000 .0000 0002 .0049 .0370 
15.0000 .0000 0000 0013 0148 
16 = .0000 .0000 .0000 .0003 .0046 
17.0000 .0000 .0000 .0000 0011 
18 0000 .0000 .0000 .0000 0002 
19 0000 .0000 0000 .0000 0000 


N 
So 


.0000 0000 .0000 .0000 0000 


Table of Poisson Probabilities 


Pr(X =k) = * 
oS -s Ss f# Sf @6@ FSF & © 10 
0  .9048 8187 .7408 .6703 .6065 .5488 .4966 .4493 .4066 3679 
1 0905 .1637  .2222 2681 .3033 .3293 3476 .3595 3659 .3679 
2 0045 .0164 .0333 .0536 .0758 .0988 1217 1438 .1647 1839 
3 0002 0011 .0033 .0072 .0126 .0198 .0284 .0383 .0494 .0613 
4 0000 .0001 .0003 .0007 .0016 .0030 .0050 .0077 .0111  .0153 
5 0000 .0000 .0000 .0001 .0002 .0004 .0007 .0012 .0020 .0031 
6 0000 .0000 .0000 .0000 .0000 .0000 .0001 .0002 .0003 .0005 
7.0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0001 
8 0000 0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 
bk tate 2 3 4 5 6 7 8 9 10 
0 2231 1353 .0498 .0183 .0067 .0025 .0009 .0003 .0001 .0000 
1 3347 2707 1494 .0733 0337 .0149 .0064 .0027 .0011 .0005 
2 2510 2707 .2240 .1465 .0842 .0446 .0223 .0107 .0050 .0023 
3. 1255 1804 2240 1954 1404 .0892 .0521 .0286 .0150 .0076 
4 0471 0902 .1680 .1954 1755 .1339 .0912 .0573 .0337 .0189 
5 0141 .0361 .1008 .1563 .1755 .1606 .1277 .0916 .0607 .0378 
6 .0035 0120 .0504 1042 1462 1606 .1490 1221 .0911 .0631 
7 0008 .0034 .0216 .0595 1044 1377 1490 1396 1171 .0901 
8 0001 .0009 .0081 .0298 .0653 .1033 .1304 1396 .1318 .1126 
9 0000 .0002 .0027 .0132 .0363 .0688 .1014 .1241 1318 .1251 
10 0000 .0000 .0008 .0053 .0181 .0413 .0710 .0993 .1186 1251 
11.0000 .0000 .0002 .0019 .0082 .0225 .0452 .0722 .0970 1137 
12 0000 .0000 .0001 .0006 .0034 .0113 .0264 .0481 .0728 .0948 
13.0000 .0000 .0000 .0002 .0013 .0052 .0142 .0296 .0504 .0729 
14 0000 .0000 .0000 .0001 .0005 .0022 .0071 .0169 .0324 .0521 
15.0000 .0000 .0000 .0000 .0002 .0009 .0033 .0090 .0194 .0347 
16 0000 .0000 .0000 .0000 .0000 .0003 .0014 .0045 .0109 .0217 
17.0000 0000 .0000 .0000 .0000 .0001 .0006 .0021 .0058 .0128 
18 .0000 .0000 .0000 .0000 .0000 .0000 .0002 .0009 .0029 .0071 
19 .0000 .0000 .0000 .0000 .0000 .0000 .0001 .0004 .0014 .0037 
20  .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0002 .0006 .0019 
21.0000 0000 .0000 .0000 .0000 .0000 .0000 .0001 .0003 .0009 
22 0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0001 .0004 
23.0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0002 
24 0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0001 
25  .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 


Tables 
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Table of the x? Distribution 


If X has a x? distribution with m degrees of freedom, this table gives the value of x 
such that Pr(X <x) = p, the p quantile of X. 


Pp 
m .005 01 .025 .05 10 .20 25 30 40 
1 .0000 .0002 .0010 .0039 0158 0642 1015 1484 .2750 
2 .0100 .0201 0506 1026 .2107 4463 5754 .7133 1.022 
3 .O717 1148 2158 3518 5844 1.005 1.213 1.424 1.869 
4 .2070 2971 4844 .7107 1.064 1.649 1.923 2.195 2153 
5 4117 5543 8312 1.145 1.610 2.343 2.675 3.000 3.655 
6 .6757 8721 1.237 1.635 2.204 3.070 3.455 3.828 4.570 
7 .9893 1.239 1.690 2.167 2.833 3.822 4.255 4.671 5.493 
8 1.344 1.647 2.180 2.732 3.490 4.594 5.071 5.527 6.423 
9 1.735 2.088 2.700 3.325 4.168 5.380 5.899 6.393 7.357 
10 =—-.2.156 2.558 3.247 3.940 4.865 6.179 6.737 7.267 8.295 
11 2.603 3.053 3.816 4.575 5.578 6.989 7.584 8.148 9.237 
12. = 3.074 3.571 4.404 5.226 6.304 7.807 8.438 9.034 10.18 
13 3.565 4.107 5.009 5.892 7.042 8.634 9.299 9.926 11.13 
14 4.075 4.660 5.629 6.571 7.790 9.467 10.17 10.82 12.08 
15 4.601 5.229 6.262 7.261 8.547 10.31 11.04 11.72 13.03 
16 5.142 5.812 6.908 7.962 9.312 11.15 11.91 12.62 13.98 
17 5.697 6.408 7.564 8.672 10.09 12.00 12.79 13:53 14.94 
18 6.265 7.015 8.231 9.390 10.86 12.86 13.68 14.43 15.89 
19 6.844 7.633 8.907 10.12 11.65 13.72 14.56 15.35 16.85 
20 = =7.434 8.260 9.591 10.85 12.44 14.58 15.45 16.27 17.81 
21 8.034 8.897 10.28 11.59 13.24 15.44 16.34 17.18 18.77 
22. ~~ 8.643 9.542 10.98 12.34 14.04 16.31 17.24 18.10 19.73 
23 9.260 10.20 11.69 13.09 14.85 17.19 18.14 19.02 20.69 
24 9.886 10.86 12.40 13.85 15.66 18.06 19.04 19.94 21.65 
25. 10.52 11.52 13.12 14.61 16.47 18.94 19.94 20.87 22.62 
30 = 13.79 14.95 16.79 18.49 20.60 23.36 24.48 25.51 27.44 
40 20.71 22.16 24.43 26.51 29.05 32.34 33.66 34.87 36.16 
50. 27.99 29.71 32.36 34.76 37.69 41.45 42.94 44.31 46.86 
60 35.53 37.48 40.48 43.19 46.46 50.64 52.29 53.81 56.62 
70 43.27 45.44 48.76 51.74 55.33 59.90 61.70 63.35 66.40 
80 51.17 53.54 STIS 60.39 64.28 69.21 71.14 72.92 76.19 
90 59.20 61.75 65.65 69.13 73.29 78.56 80.62 82.51 85.99 
100 = 67.33 70.06 74.22 77.93 82.86 87.95 90.13 92.13 95.81 


“Table of the X2 Distribution” adapted in part from “A new table of percentage points of the chi-square 
distribution” by H. Leon Harter. From BIOMETRIKA, vol 51(1964), pp. 231-239. 

“Table of the X2 Distribution” adapted in part from the BIOMETRIKA TABLES FOR STATISTI- 
CIANS, Vol. 1, 3rd ed., Cambridge University Press, © 1966, edited by E.S. Pearson and H.O. Hartley. 
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Table of the x2 Distribution (continued) 


Pp 
50 .60 70 75 80 90 95 ITS 99 995 
4549 .7083 1.074 1.323 1.642 2.706 3.841 5.024 6.635 7.879 
1.386 1.833 2.408 2413 3.219 4.605 5.991 7.378 9.210 10.60 
2.366 2.946 3.665 4.108 4.642 6.251 7.815 9.348 11.34 12.84 
3.357 4.045 4.878 5.385 5.989 7.779 9.488 11.14 13.28 14.86 
4.351 5.132 6.064 6.626 7.289 9.236 11.07 12.83 15.09 16.75 
5.348 6.211 7.231 7.841 8.558 10.64 12.59 14.45 16.81 18.55 
6.346 7.283 8.383 9.037 9.803 12.02 14.07 16.01 18.48 20.28 
7.344 8.351 9.524 10.22 11.03 13.36 Id.51 17.53 20.09 21.95 
8.343 9.414 10.66 11.39 12.24 14.68 16.92 19.02 21.67 23.59 
9.342 10.47 11.78 12-55: 13.44 15.99 18.31 20.48 23.21 25.19 
10.34 11.53 12.90 13.70 14.63 17.27 19.68 21.92 24.72 26.76 
11.34 12.58 14.01 14.85 15.81 18.55 21.03 23.34 26.22 28.30 
12.34 13.64 15.12 15.98 16.98 19.81 22.36 24.74 27.69 29.82 
13.34 14.69 16.22 17.12 18.15 21.06 23.68 26.12 29.14 31.32 
14.34 15.73 17.32 18.25 19.31 22.31 25.00 27.49 30.58 32.80 
15.34 16.78 18.42 19.37 20.47 23.54 26.30 28.85 32.00 34.27 
16.34 17.82 19.51 20.49 21.61 24.77 27.59 30.19 33.41 35.72 
17.34 18.87 20.60 21.60 22.76 25.99 28.87 31.53 34.81 37.16 
18.34 19.91 21.69 22.72 23.90 27.20 30.14 32.85 36.19 38.58 
19.34 20.95 22.77 23.83 25.04 28.41 31.41 34.17 37.57 40.00 
20.34 21.99 23.86 24.93 26.17 29.62 32.67 35.48 38.93 41.40 
21.34 23.03 24.94 26.04 2130 30.81 33.92 36.78 40.29 42.80 
22.34 24.07 26.02 27.14 28.43 32.01 39.17 38.08 41.64 44.18 
23.34 29:11 27.10 28.24 29.55 33.20 36.42 39.36 42.98 45.56 
24.34 26.14 28.17 29.34 30.68 34.38 37.65 40.65 44.31 46.93 
29.34 31.32 33:53 34.80 36.25 40.26 43.77 46.98 50.89 53.67 
39.34 41.62 44.16 45.62 47.27 51.81 55.76 59.34 63.69 66.77 
49.33 51.89 54.72 56.33 58.16 63.17 67.51 71.42 76.15 79.49 
59.33 62.13 65.23 66.98 68.97 74.40 79.08 83.30 88.38 91.95 
69.33 72.36 75.69 77.58 79.71 85.53 90.53 95.02 100.4 104.2 
79.33 82.57 86.12 88.13 90.41 96.58 101.9 106.6 112.3 116.3 
89.33 92.76 96.52 98.65 101.1 107.6 113.1 118.1 124.1 128.3 


99.33 102.9 106.9 109.1 111.7 118.5 124.3 129.6 135.8 140.2 
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Table of the t Distribution 


If X has a ¢ distribution with m degrees of freedom, the table gives the value of x 
such that Pr(X < x) = p. 


m p=.55  .60 .65 .70 Te) .80 85 90 95 975 99 995 


158 325 510 = .727) «1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 
142 289 =445 617 ~=.816 «1.061 1.386 1.886 2.920 4.303 6965 9.925 
137 277) 424 584765 978 1.250) 1.638 2.353 3.182) 4.541 5.841 
134 271 = 414569741941 1.190 «1.533 2.132 2.776 393.747 ~~ 4.604 
267 408 559 727 920 1.156 1.476 2.015 2.571 3.365 4.032 
131 265 404 553 718 .906 1.134 1440 1.943 2447 3.143 3.707 

130 263 402 549 711 896 1.119 1415 1.895 2.365 2.998 3.499 

130 262 399 546 .706 889 1.108 1.397 1.860 2.306 2.896 3.355 

9 129 261 398 §=.543.—.703—Ss «883 1.100 = 1.383) 1.833 2.262) = 2.821 = 3.250 
10 129 260 397 542 =.700~)=— 879 1.093) 1.372 1.812 2.228 2.764 3.169 


11 129 260 396 540 697 876 1.088 1.363 1.796 2.201 2.718 3.106 
12 128 259 395 539 695 873 :1.083 1.356) «1.782 «2.179 2.681 = 3.055 
13 128 259 =.394 538694 870): 1.079 1.350 1.771 = =92.160 )=—2.650 = 3.012 
14 128 258 393 537) 692s «868 1.076 1.345 1.761 =2.145 2.624 = 2.977 
15 128 258 =.393 536 691 «866 :1.074 1.341 1.753 2.131 2.602 2.947 
16 128 258 392 «5350 690865) 1.071) = 1.337) 1.746 892.120 2.583 2.921 
17 128 257 = 392 534s 689) «863s: 1.069 1.333) 1.740 2.110 2.567 2.898 
18 127 257) = 392 534s 688 862—s-1.067 1.330) 1.734 392.101 2.552 2.878 
19 127 257) = 391 533 68881) 1.066 1.328 = 1.729 2.093 2.539 2.861 
20 127 257) = 391 533 «687860: 1.064 1.325) 1.725 92.086 §=2.528 2.845 


21 127 257) = 391 532) «686 859) 1.063) 1.323) 1.721 = 2.080 2.518 2.831 
22 127 256 8.390 36.532, «686-858 1.061) «1.321 «61.717 = 2.074 =2.508 = 2.819 
23 127 256 86.390) )=6.532, «685858 1.060) «1.319 1.714 2.069 2.500 2.807 
24 127 256 8.390) =©6.531)— 685) 857) 1.059) 1.318 1.711 82.064 2.492 2.797 
25 127 256 86.390) =6.531— 684 856) =«1.058 = 1.316 1.708 2.060 2485 2.787 
26 127 256 8.390 =©.531— 684 856) «1.058 = 1.315 1.706 2.056 2.479 2.779 
27 127 256 8.389.531 684855 1.057) 1.314 «1.703 2.052 2.473 2.771 
28 127 256 8 =©.389 530683 855 :1.056 «1.313 1.701 =. 2.048 2.467 ~— 2.763 
29 127 256 8.389.530 «683s 854 1.055) 1.311 «61.699 2.045 2.462 2.756 
30 127 256 8 =©.389 530 683 854 1.055) 1.310 1.697 2.042 2.457 2.750 


40 126 255 388 529 681 851 1.050 1.303 1.684 2.021 2.423 2.704 
60 126 254 9.387) 527, «679 848 1.046 «1.296 1.671 2.000 2.390 2.660 
120 126 254 386 =6.526— 677) 845 1.041) 1.289) 1.658) =1.980) 2.358 2.617 
oe) 126 253, 385 524 674 842) 1.036 1.282 1.645 1.960 2.326 2.576 


AANNF WNP 
a 
G2 
N 


Table HI, “Table of the t Distribution” from STATISTICAL TABLES FOR BIOLOGICAL, AGRICUL- 
TURAL, AND MEDICAL RESEARCH by R.A. Fisher and F. Yates. © 1963 by Pearson Education, 
Ltd. 


Table of the Standard Normal Distribution Function 


Ga)= f°. omnia exp(— 4) du 


x D(x) x D(x) x D(x) x D(x) x D(x) 
0.00 0.5000 0.60 0.7257 1.20 ~=—-0.8849 1.80 0.9641 2.40 0.9918 
0.01 0.5040 0.61 = 0.7291 1.21 0.8869 1.81 0.9649 2.41 0.9920 
0.02 0.5080 0.62 0.7324 1.22 0.8888 1.82 0.9656 2.42 0.9922 
0.03 0.5120 0.63 0.7357 1.23 0.8907 1.83 0.9664 2.43 0.9925 
0.04 0.5160 0.64 0.7389 1.24 0.8925 1.84 0.9671 2.44 0.9927 
0.05 0.5199 0.65 0.7422 1.25 0.8944 1.85 0.9678 2.45 0.9929 
0.06 0.5239 0.66 0.7454 1.26 ~=—0.8962 1.86 0.9686 2.46 0.9931 
0.07 0.5279 0.67 0.7486 1.27. 0.8980 1.87 0.9693 2.47 0.9932 
0.08 0.5319 0.68 = 0.7517 1.28 0.8997 1.88 0.9699 2.48 0.9934 
0.09 0.5359 0.69 0.7549 1.29 0.9015 1.89 0.9706 2.49 0.9936 
0.10 0.5398 0.70 0.7580 1.30 =—0.9032 1.90 0.9713 2.50 0.9938 
0.11 0.5438 0.71 ~=0.7611 1.31 0.9049 1.91 0.9719 2.52 0.9941 
0.12 0.5478 0.72 0.7642 1.32 0.9066 1.92 0.9726 2.54 0.9945 
0.13 0.5517 0.73, (0.7673 1.33 0.9082 1.93 0.9732 2.56 0.9948 
0.14 0.5557 0.74 ~~ 0.7704 1.34 0.9099 1.94 0.9738 2.58 0.9951 
0.15 0.5596 0.75 0.7734 1.35 0.9115 1.95 0.9744 2.60 0.9953 
0.16 0.5636 0.76 0.7764 1.36 =—-0.9131 1.96 0.9750 2.62 0.9956 
0.17. 0.5675 0.77. 0.7794 1.37. 0.9147 1.97 0.9756 2.64 0.9959 
0.18 0.5714 0.78 0.7823 1.38 0.9162 1.98 0.9761 2.66 0.9961 
0.19 = 0.5753 0.79 0.7852 1.39 = 0.9177 1.99 0.9767 2.68 0.9963 
0.20 0.5793 0.80 0.7881 1.40 0.9192 2.00 0.9773 2.70 0.9965 
0.21 0.5832 0.81 0.7910 1.41 0.9207 2.01 0.9778 2.72 0.9967 
0.22 (0.5871 0.82. 0.7939 1.42 0.9222 2.02 0.9783 2.74 0.9969 
0.23 0.5910 0.83. 0.7967 1.43 0.9236 2.03 0.9788 2.76 0.9971 
0.24 0.5948 0.84 0.7995 1.44 0.9251 2.04 0.9793 2.78 0.9973 
0.25 0.5987 0.85 0.8023 1.45 0.9265 2.05 0.9798 2.80 0.9974 
0.26 0.6026 0.86 0.8051 1.46 0.9279 2.06 0.9803 2.82 0.9976 
0.27 0.6064 0.87 0.8079 1.47 0.9292 2.07 0.9808 2.84 0.9977 
0.28 0.6103 0.88 0.8106 1.48 0.9306 2.08 0.9812 2.86 0.9979 
0.29 = 0.6141 0.89 0.8133 1.49 0.9319 2.09 0.9817 2.88 0.9980 
0.30 0.6179 0.90 0.8159 1.50 0.9332 2.10 0.9821 2.90 0.9981 
0.31 0.6217 0.91 0.8186 1.51 0.9345 2.11 0.9826 2.92 0.9983 
0.32 0.6255 0.92 0.8212 1.52 0.9357 2.12 0.9830 2.94 0.9984 
0.33 0.6293 0.93 0.8238 1.53 0.9370 2.13 0.9834 2.96 0.9985 
0.34 ~=—0.6331 0.94 0.8264 1.54 0.9382 2.14 0.9838 2.98 0.9986 
0.35 0.6368 0.95 0.8289 1.55 0.9394 2.15 0.9842 3.00 0.9987 
0.36 0.6406 0.96 0.8315 1.56 0.9406 2.16 0.9846 3.05 0.9989 
0.37 0.6443 0.97 0.8340 1.57 0.9418 2.17 0.9850 3.10 0.9990 
0.38 0.6480 0.98 0.8365 1.58 0.9429 2.18 0.9854 3.15 0.9992 
0.39 0.6517 0.99 0.8389 1.59 0.9441 2.19 0.9857 3.20 0.9993 
0.40 0.6554 1.00 0.8413 1.60 0.9452 2.20 0.9861 3.25 0.9994 
0.41 0.6591 1.01 0.8437 1.61 0.9463 2.21 0.9864 3.30 0.9995 
0.42 0.6628 1.02 0.8461 1.62 0.9474 2.22 0.9868 3.35 0.9996 
0.43 0.6664 1.03 0.8485 1.63 0.9485 2.23 0.9871 3.40 0.9997 
0.44 0.6700 1.04 0.8508 1.64 0.9495 2.24 0.9875 3.45 0.9997 
0.45 0.6736 1.05 0.8531 1.65 0.9505 2.25 0.9878 3.50 0.9998 
0.46 0.6772 1.06 0.8554 1.66 0.9515 2.26 0.9881 3.55 0.9998 
0.47 (0.6808 1.07 0.8577 1.67. = 0.9525 2.27 0.9884 3.60 0.9998 
0.48 0.6844 1.08 0.8599 1.68 0.9535 2.28 0.9887 3.65 0.9999 
0.49 0.6879 1.09 0.8621 1.69 0.9545 2.29 0.9890 3.70 0.9999 
0.50 0.6915 1.10 0.8643 1.70 0.9554 2.30 0.9893 3.75 0.9999 
0.51 0.6950 1.11 0.8665 1.71 0.9564 2.31 0.9896 3.80 0.9999 
0.52 0.6985 1.12 0.8686 1.72 0.9573 2.32 0.9898 3.85 0.9999 
0.53 0.7019 1.13 0.8708 1.73 0.9582 2.33 0.9901 3.90 1.0000 
0.54 0.7054 1.14 0.8729 1.74 0.9591 2.34 0.9904 3.95 1.0000 
0.55 0.7088 1.15 0.8749 1.75 0.9599 2.35 0.9906 4.00 1.0000 
0.56 = 0.7123 1.16 0.8770 1.76 0.9608 2.36 0.9909 
0.57 0.7157 1.17 0.8790 1.77 (0.9616 2.37 0.9911 
0.58 0.7190 1.18 0.8810 1.78 0.9625 2.38 0.9913 
0.59 0.7224 1.19 0.8830 1.79 0.9633 2.39 0.9916 


“Table of the Standard Normal Distribution Function” from HANDBOOK OF STATISTICAL TABLES 
by Donald B. Owen. © 1962 by Addison-Wesley. 
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Probability and Statistics 


ANSWERS TO ODD-NUMBERED EXERCISES 


Note: Answers are not provided for exercises that request a proof, a derivation, or a graph. 


Chapter | 


Section 1.4 

7. (a) {x:x <1lorx > 5}; (b) {x: 1 <x <7}; (c) B; (d) {4:0 <x <1lorx > 7}; (e) &. 11. (a) S={(x, y) :0<x< 
Sand 0 <y <5}. ). (b) A= {Or yeS:x+y>6}, B={@, y)eS:x=y}, C={@, y) €S:x>y}, D={@, y)eS:5< 
x+y <6}. (c) ASN D°NB. (d) ASN BENC.. 

Section 1.5 

1.2. 3.(a)5; (BE; (3. 5.04. 7.04ifACBandO1ifPr(AUB)=1. — 11.(a)1— 3; (b) #; (©) ¥; (AO. 
Section 1.6 

15. 3.3. 5.9. 7. Pr(Aa) = Pr(aa) = 5. 


Section 1.7 
5 20! GI)? 
1. 14. 3. 5h. 5. 18° Te 8120!2° 9. ies 
Section 1.8 
di: Ga): 3. They are equal. 5. This number is eae and therefore it must be an integer. 7. oy: 9. roa 
k n 
20 13 
11, 043, GIG 47, 4G) gy. (20544), 
12) (i Cy ) 
Section 1.9 
1. (2! 3. (. 300 5. 2 n 7 (6.2.4) (4.6.3) 9 4! 
. (7:7,7)- . (5,8, 287) : Biche adie " 25 : : 32 ° 
(oo, 8, 7 (13, 13,13; 13) 
Section 1.10 
48 
1. 3G 3) 32 G 0) 3. 45 percent. 5. 2. 7.1 “ {[¢9) ee) CS) 4 (3) | = 
(5) (5,54 Gs) 
70), (60) , (50) , (50) , (40) , (30 40) _, (30) , (20 (62) : 
[(3) + (i5) + (is) + Gs) + Gs) + (| + (9) + (j5) + (3) 9.n=10. i, (i) , where r= 5 and 
x=0,2,...,10. : 
Section 1.12 
250) 100 Ne 3. n—k+1 
1No. 3, GG) 5.93120 0 ate. 9, DS) where k=2j —2and j=2,3,4,5. 13. (d) ch i. 
Gy) ( 7 ) (5) 
Chapter 2 
Section 2.1 
(r+k)(r+2k)b 1 3. 3 
1.Pr(A)/Pr(B). 3. Pr(A). 5. aap EE ay S73 (a) 5 (b)3. 13.00.44. 15.047. 
Section 2.2 
log(0.2 
1.Pr(A%). 5.1- 7G. —7.(a) 0.92; (b) 0.8696 9.7. 11. (a) 0.2617. 13. 10(0.01)(0.99)?. 15. > ions 
17. 4. 19. [(0.8)!° + (0.7)!9] — [(0.2)!9 + (0.3)!9], 23. (a) 0.2215; (b) 0.0234. 
Section 2.3 


3.0301. 5.2. 7. (a) 0, 4h, & > aps (BD) 2: © F- 11.1/4. 13. (a) 1/9; (b) 1. 15. 0.274. 
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Section 2.4 
3.Condition (a). 5.1>198. 9. 3. 


Section 2.5 
50 
11 1 1 49 : 4 

3. Seay Te Always 9.3. IL 1 - (33) . 13. (a) 0.93; (b) 0.38. 15. &. 17. 0.067. 
19. py + p2 + P3 — P1P2 — P2P3 — P1P3 + P1p2P3, Where 

6 6 6 

~ 78y’ ~ 78)’ ~ 78)" 
(3) (1) (5) 
-(4)"" 

21. Pr(A wins) = a Pr(B wins) = a Pr(C wins) = 23. 0.372. 25. (a) 0.659; (b) 0.051. 27. Z ; 


IW 
io) 
w 

— 
fee) 
wm 
oO 


48 8: 3) (48 3) (48 48 
29, (a) 52021, where py = 3 and p; = “G2. (b) 1— py. Also, BGV+QG0I*() 9.5612 33. 2. 


13) (3) (ia) 


13. 2 
second condition; (b) The first condition; (c) Equal probability under both conditions. 


Chapter 3 
Section 3.1 
6 1 5 2 1 1 1 (D2) for x = 2, 3, 4,5 
1.7. 3.fO=%5, f/D=—,/O=5 f/O=% (HM =5,/O) = B- 5. f(x) = Cy ae arta PS 
(an otherwise 


7. 0.806. 9.1/2. 
Section 3.2 


1 
1. 4/9. 3. (a) 5; (b) 3; (c) a. 5. (a) t =2; (b) t= V8. Tf(x= mg. MOE =2 S*5 8 and probability is 
0 otherwise, 
7 
0° 13. 0.0045. 
Section 3.3 
0 for x < —2, 
5. f(x) = (2/9)x for 0 < x <3; f(x) =0 otherwise. 7.F(x)=} §@+2) for-2<x <8, 11. F~!(p) = 3p!/?. 
1 for x > 9. 


13. 10.2. 15. F(x) =x? for0 <x <1. 


Section 3.4 

1. (a) 05; (6)0.75. Z2ag@OpOA@M) 5a? 6): © BO. — 7. (a) 0.55; (b) 0.8. 
9.0.63505. 11. (a) 0.273; (b) 0.513. 

Section 3.5 


1 
1. Uniform on the interval [a, b] and uniform on the interval [c, d]. 3. (a) fix) = | z for0<x <2, 
0 otherwise 
3y? forO<y<1 PxPy forx=0,1,2,3and y=0, 1, 2, 3, 
=~" =~? (b) Yes; (c) Yes. 5.(a y= y : 
{3 otherwise (b) (c) (a) Fy) i otherwise 


ho)= 
(b) 0.3; (c) 0.35. 


1 
7.Yes. 9. @) Fon =] Fore VES, 
0 otherwise 


1 1 

fi@ = zg tor0StS2, ZG)= 3 forlsys4 wm) yes E15. (b) A(z) = 1/3 for 1 <x <3, 
0 otherwise 0 otherwise 

Ff, (x) = 1/6 for 6 < x < 8, and f,(x) = 0 otherwise; f5(y) = 1 for 0 < y < land f,(y) = 0 otherwise. 

Section 3.6 

145x201 — y2) "for — — y?)'7 <x <1 — y*)!”, 


1. For -1< y<1, gy(x|y) = 
otherwise. 
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1 2 ae _— 1)2 
3. (a) For -2 <x <4, gy(y|x) = 9—@—p2 772 for (y +2) <9-(x —1)*, 
0 otherwise. 


=i 
(b) 2=¥2. 5. (a) For0<y <1, gialy) = C=aytog=sy fOTO<¥<y (hy $7. (a) For 0 <x <2, gr(ylx) = 
0 otherwise 


4—2x-y _ 1 

| cg-ee TE URYSe— ee yd Daas | ge any Tore ret, ys. 13. gi Cit) = 05506, 
0 otherwise 0 otherwise 

g1(1|2) = 0.6561, 21 (1|3) = 0.4229, g1(1|4) = 0.2952. g)(O|y) =1— g;(1|y) for y = 1, 2, 3, 4. 


Section 3.7 
1. (a) 1/3; (b) (xy + 3x3 + 1)/3 for 0 < x; <1 (i = 1, 3); (c) 5/13. 3. (a) 6; 


—(x4+3x3) : fod 
(6) fates 3e— 193) for x; > OG = 1, 3), (c)1— 1 
0 otherwise 


5. (a) [14 pis (b) 1- TUyd-p). = 7. Oy ("pid — py", where p= f? f(x) dx 


Section 3.8 
a wh ll2 il 
Loti | 3(1— y)/*/2 forO =) <1, 3. G(y) =1—-(1—y)!/? for0 < y <1; g(y) = Tayi for 0 <2 <1, 
0 otherwise. 0 otherwise 
1,-1/2 1),,)-2/3 a 2y ford 1, 
Ts (a) co={ ay poe vets (b) co=| 3/yI for ae aa (c) co=| y can 
0 otherwise 0 otherwise 0 otherwise 


9.¥=2x"3, 13. f(t) = | 2e7/"/1? fort > 0, 
0 otherwise. 


r(x) = 5000 for x > 5100; (b) G(y) = 0 for y <0, G(y) =1—1/(y + 101) for 0 < y < 5000, G(y) = 1 for y > 5000. 


17. (a) r(x) = 0 for x < 100, r(x) = x — 100 for 100 < x < 5100, 


Section 3.9 

y for0<y <1, 4 
Lg(y)=}2-y forl<y <2, 3. e(o1 92.99) = | S802) po SOR SA: 

: 0 otherwise. 

0 otherwise 

g(zt+1) for0<z<1, —_ see, Bae 
S.@(2)=) Let forz>1, 7. g(y) =4e7P I for —co<y<oo. 9.(0.8)"— 7)". IL. (3) a (3) 

0 for z < 0. 

1 : n—2 5 
13. f(z) = a (:) (1 2 i) for—3<z<5, 19. ye~ for y > 0. 
0 otherwise 


Section 3.10 


1. (a) (1/2,1/2); (b) ee 519) 3. (a) 0.667; (b) 0.666. 5. (a) 0.38; (b) 0.338; (c) 0.3338. 7. (a) 0.632; (b) 0.605. 


9.(a) g; (b) $11. (a) BP; (b) H. 
13. 
HHH HHT HTH THH TTH THT HIT TIT 


HHH 0 1 0 0 0 0 0 0 
HHT 0 0 5 0 0 0 5 #60 
HTH 0 0 0 4 0 ; 0 0 
THH 3} 5 0 0 0 0 0 0 
TTH 0 0 0 ; 0 5 0 0 
THT 0 0 4 0 0 0 + 0 
HTT 0 0 0 0 3 0 0 5 
Tr 4 0 0 0 1 0 0 0 
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17. (a) {Aa, Aa} has probability 1; (b) {Aa, Aa}, {Aa, aa}, and {aa, aa} have, respectively, probabilities 0.04, 0.32, and 
0.64. 19. (2/3, 1/3). 


Section 3.11 


2 for0<x <1, 
> i 1 1 1 2x 
a pon=| forl <x <2, 5S. 7: Tod = Sot aie 9. i9- 11. Y=5(1—e°*“) or Y= 
0 otherwise 
Se24, 13. The sets (c) and (d). 15. 0.3715. 17. fo(y) = —9y” log y for O< y <1. gi(aly) = — Slogy 
forO<y<x<1. 19. f(x) =3(1 — x)? for 0 <x <1, fo(y) =6y(1— y) for 0 < y <1, fa(z) =32? for 0 <z <1. 


—v -y —Ypnyn—2 ,—y 
ve forO <u <1,v>0, (b) Yes. 3 hays WIN (e Peay en oe nee 


21. (a) glu, 0) = 
0 otherwise 
25. (a) 2efo(y); (b) 2e J", f(s, yds. 


21. Players in game n + 1 

| 4B 40 B&O 
Players in (A, B) 0 0.3 0.7 
game 7 (A, C) 0.6 0 0.4 


(B, C) 0.8 0.2 0 


29. (0.4220, 0.2018, 0.3761). 


Chapter 4 
Section 4.1 
L(a+b)/2. 3.18.92. 5.4867. 9.3. 11.1, and =3,. 13. $11.61. 15. $25. 
Section 4.2 
1 b 5 nh 
1.95. 3.4. Sn f? f(x) dx. 7.¢(3) .  9n2p—1). — 11.2k. 
Section 4.3 
11/12. 3.4(6-a)?. 7. (a) 6; (b) 39. 9.(n*—1)/12. 11.0.5. 13.1. 


Section 4.4 
a1 2 = 3 st ns = 2 2 — 1. a 2 ee) 
1.0. 3.1. 7. [L= 75,0 =7: 9. E(Y) =cy; Var(Y) = c(o + UL Ye 11. fO=5;f/M=§; f®=s5. 17. 2. 


Section 4.5 
3. m=log 2. 5. (a) 5 (uf + fg); (b) Any number m such that 1 <m <2. 7. (a) b: (b) (V5 — 1). 
9. (a) 0.1; (b) 1. 11. Y. 
Section 4.6 
1.0. 11. The value of p(X, Y) would be less than —1. 13. (a) 11; (b) 51. 15.n+ ae). 
Section 4.7 
F 1 _ 3X42. _1 1 

1. 0.00576, 7% of the marginal M.S.E. Sak 7. E(Y|X) = 3ht25; Var 1X) = & [3 = as. 

log 3 Z = 
a B=. 54a ye. 
Section 4.8 
Leet, 30% 5.3 7p.  9.a=1if p> 5;a=0if p < 5; a can be chosen arbitrarily if p = 5. 
11.b=0if p<};b=(Q2p—-DAif p> 5. 13. b = A if p> 3; b =O if p < 4; b can be chosen arbitrarily if p = 5. 
15. x9 > mae 17. Continue to promote. 
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Section 4.9 


5.a= £e, b=-—aw. Ts 2. 11. Order an amount s such that fj f(x) dx = eee 13. (a) and (b) E(Z) = 29; 


Var(Z) = 109. (c) E(Z) = 29; Var(Z)=94. 17.1. 24-4. 25. (a) 0.1333. (b) 0.1414. 29.a = pm. 


Chapter 5 
Section 5.2 
1. Bernoulli with parameter i. 3. 0.377. 5. 0.5000. 7. ie. 9. x 
11. n(n — 1)p?. 13.0.4957 15. 1110, 4.64 x 107271, 
Section 5.3 
1.8.39 x 10-8. 3. E(X) = 4 Var(X) = ar 5. rH or rH if T is odd, and + if T is even. 
Cio )+0.37 C5") 10 9 
7. (a) a: aa (b) (0.7)°* + 10(0.3)(0.7)”. 9. 3/128. 
10 
Section 5.4 
“. (n eT Ai * Le en = Se 3030" 
1. 0.5940. 3. 0.0166. >, ( ) 7 > 7 : as = 9. Poisson 
zom WS Nice ixo | ga. 
distribution with mean pd. 11. If A is not an integer, mode is the greatest integer less than i. If A is an integer, modes 


are A anda — 1. 13. 0.3476. 15. 9e7*, for A > 0. 


Section 5.5 
1. (a) 0.0001; (b) 0.01. 3. (a) 150; (b) 4350. 9. Geometric distribution with parameter p = 1—]]}_, qi. 


Section 5.6 

1. 0.0, —0.6745, 0.6745, —1.282, 1.282. 3. Normal with 4» = 20 and o = 0 5. 0.996. 7. (0.1360). 9. 0.6827. 
exp{—}(x—25)?} 

exp{ 5 (x 25)?} +9 exp{ 5 (x 20)} 

opis exp | oh (log x »| for x > 0, and f(x) =0 for x <0. 19. f(u) = eat exp | Ly 8} for 

S5<yp <\15. 


11. n = 1083. 13. 0.3812. 15. (a) 


5 (b) x > 22.54 } log9. 17. f(x)= 


21. The lognormal distribution with parameters 4.6 and 10.5. 23. The lognormal distribution with parameters 3.149 
and 2. 


Section 5.7 
T.1—[l—exp(-pyP. 9.2. (R44 4+b)z Ble 5, 


15. e79/4 17.1-3-5---(2n — 10". 


Section 5.8 

-1/y) — pl a(at+l)---(@t+r—1)B(B+1)---(B+s—) _ = 
1p) =p, 5. STGP tps | 2 = 1/17, B = 19/17. 
Section 5.9 


3 ae. 5. 0.0501. 


Section 5.10 


1. 70.57. 3. 0.1562. 5.90 and 36. 7. wy =4, Up = —2, 0, = 1, op =2, p = -0.3. 13. p = —0.5c/(ab)!/2, 
a? = 2b/d, 02 =2a/d, 4 = (cg — 2be)/d, 7 = (ce — 2ag)/d, where d = 4ab — c?. 
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Section 5.11 
1. f(x) =1/(n +1) for x =0,..., 7. 3. 0.0404. 7, 3pt0? + pe. 9. 0.8152. 11, . 13. 0.2202. 
15. (a) Exponential, parameter 6 = 5; (b) Gamma, parameters a =k and 6 =5; (c) e*-Y/3, 23. (a) p(X, Xj) = 


_ \ 1/2 
-( 2 : 7) ' , where p; is the proportion of students in class i; (b) i =1, j =2; (c)i=3, j =4. 25. Normal 
i J 


with . = —3 and o? = 16; p(X, Y) = 4. 


Chapter 6 


Section 6.1 

1. 4x if 0 <x < 1/2, 4 — 4x if 1/2 < x <1, and 0 otherwise; 0.36; 0.2; look at where each p.d.f. is higher than the other. 
3. 0.9964. The probability looks like it might be increasing to 1. 

Section 6.2 

5. 25. 13. (a) Yes; (b) No. 17. (b) np(1 — p) and knp(1 — p/k). 21. (a) [u exp(1 — u)]"; (b) Useless bound. 


Section 6.3 

1. 0.001 3. 0.9938 5.n > 542. 7. 0.7385. 9. (a) 0.36; (b) 0.7887. 11. 0.9938. 13. Normal with 
mean 6? and variance ee 15. (c) n(¥? — 6) /[20] has approximately c.d.f. F*. 

Section 6.4 


1. 0.8169. 3. 0.0012. 5. 0.9938. 7. 0.7539. 

Section 6.5 

1. 8.00. 3. Without continuity correction, 0.473; with continuity correction, 0.571; exact probability, 0.571. 
5. arcsin(,/ X,,). 9. 0.1587. 11. (b) Normal with mean n/3 and variance n/9. 


Chapter 7 

Section 7.1 

1. X1, Xo,..., P; the X; are iid. Bernoulli with parameter p given P = p. 3. Z,, Zo, ... times of hits, parameter £, 
Y, = Z, — Zy_1 fork > 2. 5. (X, — 0.98, X,, +0.98) has probability 0.95 of containing ju. 7. Y Poisson with mean 
At, parameters A and p, X;,..., X, iid. Bernoulli with parameter p given Y = y, X = X, +--+ +X, (observable). 
Section 7.2 

1. 0.4516. 3. €(1.0|X = 3) = 0.2456; E(1.5| X = 3) = 0.7544. 5. The p.d.f. of the beta distribution with parameters 
a=3and B=6. 7. Beta distribution with parameters a = 4 and 6 =7. 9. Beta distribution with parameters 


a=4and fp =6. 11. Uniform distribution on the interval [11.2, 11.4]. 


Section 7.3 

1. 120. 3. Beta distribution with parameters a = 5 and f = 297. 5. Gamma distribution with parameters a = 16 

and 6B = 6. 7. Normal distribution with mean 69.07 and variance 0.286. 9. Normal distribution with mean 0 and 
6(8°) 

variance i 13.n>100. 17.€(@|x)= } “67 for 6 > 8, 19. ia and + 5.  21.Gamma 
0 ford <8. B- Vine (B—S-%_, logx;) 


distribution with parameters n and nXx,. 


Section 7.4 


1.2/3 and 2-1/2, 3. (a) 12 or 13; (b) 0. 5.8. 9.n > 396. 13, 4+", max(xo, X1,..., Xn)- 
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Section 7.5 

2 , | a A : A 
3. 3° 5. (a) 6 = Epos cde VIE 11. 6, =min(X,,..., X,)3 64 = max(Xq,..., X,). 
13. f= Xs f2= 7. 
Section 7.6 
1, Xp". 3. m@ =X, log2. 5. @= A [min{Xy,..., X,} + max{X),..., Xp} 71.0=0 (4). 
9.X,. 15. (i =6.75. 17, ps8. eae a x2), B=[ —%_)@_ — x20) V/[x2, — ¥2), 


28. ft) =Xy, 07 = 1X — Ky) a =a + Bis, oF pe + Bo?, p= Bai, where B = D4 (¥; — Yn OG — 
Xn—e/ WINK — Xn—0)?, & = Vn — BA, and of, = AE WIT; — & — BX). 

Section 7.8 

9. Yes. 11. No. 13. Yes. 15. Yes. 17. Yes. 

Section 7.9 

3. R(O, 5) = - §.ct="t2 7. (a) R(B, 8) =(B—3)?. 11.6 =p. 13. i) 15. exp(X,, + 0.125), 
c =0.125(1 — 3/n). 

Section 7.10 


1. (a) Beta distribution with parameters 11 and 16; (b) 11/27. 3. s. 5. a 
Fahy to 1B, 

7. (a) 3(X1 + 5X + 3X3); (b) Gamma distribution, parameters a +3 and 6 + x, + 5X2 + 5X3. 9. (a) x +1. 
(b) x + log 2. 11. p =26 - 4), where 

X thete} 

é=44 if%<], 

4 if% >}. 

13. 21/5, 15. min(X;,..., X;). 17. Xp = min(Xy,..., X,), and @ = ( jE] log x; — log fo) 19. The 


smallest integer greater than : —1.If a — 1is itself an integer, both ; — land + 5 are M.L.E.’s. 21. 16. 


Chapter 8 


Section 8.1 

1.n > 29. 3.n > 255. 5.n=10. 7.n> 16. 9.1 — G(n/t), where G(-) is the c.d.f. of the gamma distribution 
with parameters n and 6. 

Section 8.2 


1. 0.1278. 5. 0.20. 9. x? distribution with one degree of freedom. 11. 2m 1/2) 


T(m/2) 
Section 8.3 
7. (a)n =21; (b)n = 13. 9. The same for both samples. 


Section 8.4 


3.0=4/372. 5. 0.70. 

Section 8.5 

3. (a) 6.1607; (b) 2.0502; (c) 0.5607; (d) 1.8007; (e) 2.8002; (f) 6.1207. 

7. (148.1, 165.6). 9. (a) (4.7, 5.3); (b) (4.8, 5.2); (d) 0.6; (e) 0.5. 

11. Endpoints are sin” (aresin fin np ly + yV/2)), unless one of the numbers aresin /x, +n7/2@-'({1 + 
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y |/2) lies outside of the interval [0, 7/2]. 


Section 8.6 
5. fg = —5; Ap = 4; a9 = 2; Bo = 4. 7. The conditions imply that aj = i and E(w) exists only for ag > 5. 
9. (a) (157.83, 210.07); (b) (152.55, 211.79). 11. (0.446, 1.530). 13. (0.724, 3.336). 15. (a) ay = 7.5, By = 22.73, 
Ay = 13, “1 = 6.631; (b) (5.602, 7.660). 17. a, = 4.5, By = 0.4831, Ay = 10, wy = 1.379. 19. (a) Normal-gamma 
with hyperparameters a; = 11, 6, = 4885.7, Ay = 20.5, wy = 156.7; (b) (148.7, 164.7). 
Section 8.7 
1.(a) g@)=6; (b)X,. 3.5 Dy XP? - A YK - X,Y. 5.6(X) = 2%. 11. (a) All values; (b)a = 5". 
15. (c) cg = 4(1 + ). 
Section 8.8 
3. 16y= 4. 5. 1(02) = = 9. /x/2|X|, (1/2 — No?. 
Section 8.9 
= ge 
7. (a) For a(m — 1) +2B(n-D=1.(b)a= ais. p= CRE 9. arn 72° 11.X,—c Ln , where 
] 
c is the 0.99 quantile of the ¢ distribution with n — 1 degrees of freedom. 13. (a) (“4 — 1.96v,, “1 + 1.96v1), where 


41 and v, are given by Eqs. (7.3.1) and (7.3.2). 15. Normal with mean @ and variance 67/n. 21. (c) Normal with 
mean 1/6 and variance 1/[n@°]. 


Chapter 9 


Section 9.1 

1. (a) 2(B|5) =eF; (b) eH}. 

3. (a) 7(0) = 1, 20.1) = 0.3941, 2 (0.2) = 0.1558, 7 (0.3) = 0.3996, 1 (0.4) = 0.7505, (0.5) = 0.9423, 1 (0.6) = 0.9935, 
(0.7) = 0.9998, x (0.8) = 1.0000, 2 (0.9) = 1.0000, (1) = 1.0000; (b) 0.1558. 5. (a) Simple; (b) Composite; (c) Com- 
posite; (d) Composite. 9.T =p) — Xp. 11. (a) cy < 0, co = 6; (b) 0.0994. 13.3 15.1—x,if0<x <1; 
O,ifx>1. 19. (00, X, +o/n VT (1 — a9)). 


Section 9.2 

1. Reject Hy if X =1;don’treject HyifX =0. 3.(b)1. —5.(a) Reject Hy when X,, > 5 — 1.645n—!/2; (b) a(5) = 0.0877. 
7. (b) c = 31.02. 9. B(5) = (3)" 11. (a) 0.6170; (b) 0.3173; (c) 0.0455; (d) 0.0027. 13. (a) Reject Ho if 
exp(—T/2)/4 <4/(2+T)°; (b) Do not reject Hp; (d) Reject Ho if T > 13.28; (e) Do not reject Hp. 


Section 9.3 

7. The power function is 0.05 for every value of 6. 9. c = 36.62. 13. (a) Reject Hp if X, < 9.359; (b) 0.7636; 
(c) 0.9995. 

Section 9.4 

1. cy = 9 — 1.645n- 7? and cy = py + 1.645n-/?, 3.n=11. 5. cy = —0.424 and c) = 0.531. 11. c= 
ig — 1.645n-/? and cy = wg + 1.645n-/?, 

Section 9.5 

1. (a) Don’t reject Hp; (b) 0.0591. 3. U = —1.809; do not reject the claim. 5. Don’t reject Ho. 9. Since 


2 
Sp < 16.92, don’t reject Ho. 13.U= 28. the corresponding tail area is very small. 15.U= B, the corresponding 
tail area is very small. 
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Section 9.6 


1. Don’t reject Ho. 3. cy = —1.782 and cy = 1.782; Ho will not be rejected. 5. Since U = —1.672, reject Ho. 
7. —0.320 < py — M2 < 0.008. 11. (a) Do not reject Hp; (b) Do not reject Hp. 


Section 9.7 


1. Reject the null hypothesis. 3. c= 1.228. aye 7. (a) at = 7.625 and ¢5 = 3.96; (b) Don’t reject Hp. 
9. cy = 0.321 and c) = 3.77. 11. 0.265V <r <3.12V. 15. 0.8971. 19. (a) 0.0503; (b) 0.0498. 


Section 9.8 
1. X > 50.653. 3. Decide that failure was caused by a major defect if )~"_, X; > net 
choice, wo = w’, wy = w", dg =d', dy =d", Q9 = Q’, and Qy = Q”. Switch them all for the other case. 


11. (a) For the first 


Section 9.9 


1. (a) c= 1.96. 3. 0.0013. 5. (a) 1.681, 0.3021, 0.25; (b) 0.0464, 0.00126, 
3x 10788. 


Section 9.10 

1. Reject Hy if X > 2, a(5) =0.5, B(6) = 0.1563. 3. Reject Hp for X <6. 5. Reject Hy for X > 1—a!/?, 
B(S) = (1—a@!/?)?, 7. Reject Ho for X < 5[(1.4)¥2 — 1}. 9. Reject Hg for X < 0.01 or X > 1; power is 0.6627. 
11. 0.0093. 17. (a) 1; (b) = 23. (a) Reject Ho if the measurement is at least 5 + 0.1 x variance x log(woéy/[w1q]). 


Chapter 10 


Section 10.1 


7. Q =11.5; reject the hypothesis. 9. (a) Q =5.4 and corresponding tail area is 0.25; (b) Q = 8.8 and corresponding 
tail area is between 0.4 and 0.5. 


Section 10.2 
1. The results will depend on how one divides the real line into intervals, but the p-values for part (b) should be noticeably 
larger than the p-values for part (a). 3. (a) 6; = aN tN 4tNS and 6 = aN EN, ans (b) Q = 4.37 and corresponding tail 


area is 0.226. 5.4 =1.5 and Q = 7.56; corresponding tail area lies between 0.1 and 0.2. 


Section 10.3 
1. Q = 21.5; corresponding tail area is 2.2 x 10-. 5. Q = 8.6; corresponding tail area lies between 0.025 and 0.05. 


Section 10.4 


1. O = 18.8; corresponding tail area is 8.5 x 10~+. 3. Q = 18.9; corresponding 
tail area is between 0.1 and 0.05. 5. Correct value of Q is 7.2, for which the 
corresponding tail area is less than 0.05. 


Section 10.5 


7. (b) 
Proportion helped 
Older Younger 
subjects subjects 
Treatment I 0.433 0.700 


Treatment II 0.400 0.667 
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(c) : 
Proportion helped 
All subjects 
Treatment I 0.500 


Treatment II 0.600 


Section 10.6 

3. D* = 0.25; corresponding tail area is 0.11. 5. D* = 0.15; corresponding tail area is 0.63. 7. D> = 0.065; 
corresponding tail area is approximately 0.98. 9. Dinn = 0.27; corresponding tail area is 0.39. 1. Dyn = 0.50; 
corresponding tail area is 0.008. 

Section 10.7 


1. (a) 22.17; (b) 20.57, 22.02, 22.00, 22.00; (c) 22.10; (d) 22.00. 3. 0.575. 5. M.S.E. (X,,) = 0.025 and M.S.E. 
(X,,) = 0.028. 13. 1. 17. Normal, with mean equal to the IQR of the distribution (63/4 — 61/4) and variance 


[4nf 1/4)". 


Section 10.8 
3. U = 3.447; corresponding (two-sided) tail area is 0.003. 5. Dinn = 0.5333; corresponding tail area is 0.010. 


Section 10.9 

1. (141, 175). 3. Any level greater than 0.005, the smallest probability given in the table in this book. 5. Do 
1/0 

not reject the hypothesis. 9. |a| > 5 (6.635n)!/ - 15. Normal, with mean (5) : and variance oa 


17. (a) 0.031 <a < 0.994. (b) o < 0.447 oro > 2.237. 19. Uniform on the interval [y,, y3]. 


Chapter | | 


Section I 1.1 

5. y = —1.670 + 1.064x. 7. (a) y = 40.893 + 0.548x; (b) y = 38.483 + 3.440x — 0.643x?. 9. y = 3.7148 + 1.1013x; + 
1.8517x>. 11. The sum of the squares of the deviations of the observed values from the fitted curve is smaller in 
Exercise 10. 


Section 11.2 


7. (a) —0.7861, 0.6850, 0.9377; (b) 0.250502, 0.027702; (c) —0.775 9. c= 3%,=6.99. I.x=x%,=2.33. 
13.—0.891. 15.cy}=—-%,=—2.25. 17.x=%,=2.25. 


Section 11.3 


1. Since Up = —6.695, reject Hp. 3. Since U; = —6.894, reject Ho. 5. Since |Up;| = 0.664, don’t reject Hp. 9. Since 

U* =24.48, reject Hp. —11.0.246 < B) < 0.624. = 13.0.284 < y < 0.880. 17. 10(8, — 0.147)? + 10.16(B — 0.435)? + 
— 91/2 

8.4(B, — 0.147) (By — 0.435) < 0.503. 19.C =1/(n—2). 25. (a) Bo + Pixs + T_1,(1 — 09 /4)o" E + a] 


(b) a(x) = 


XQ-X41" 


Section 11.4 


5. (a) 12.21(6, — 0.4352) has the r distribution with eight degrees of freedom; 
(b) 11.25(B9 + 6; — 0.5824) has the ¢ distribution with eight degrees of freedom. 


Answers to Odd-Numbered Exercises 875 


Section 11.5 
5. B =5.126,62 = 16.994, and Var(B) =0.015002. 7. By = —0.744, B; = 0.616, By =0.013,62 =0.937. 9.U3 = 0.095; 


corresponding tail area is greater than 0.90. 11. R? = 0.644. 13. Var(fy) = 222.702, Var(B1) = 0.13550, 
Var (fo) = 0.058202, Cov(Bo, 1) = 4.83202, Cov(Bo, fr) = —3.59807, Cov(B;, Br) = —0.079202. 15. U, = 4.319; 
corresponding tail area is less than 0.01. 21. The value of the F statistic with two and seven degrees of freedom is 


1.615; corresponding tail area is greater than 0.05. 25. 87. 29. 0.893. 


Section 11.6 
5. U2 = 13.09; corresponding tail area is less than 0.025. 


Section 11.7 
5. = 3.25, ay = —2, a2 = 3,03 = —1, By = 1.75, Bo = —2.25, B3 = —1.25, By =1.75. 13.67 = 1.9647. 15. U2 = 4.664; 
corresponding tail area is between 0.05 and 0.025. 


Section 11.8 


3. (a) w=9, oy = —3, & = 3, By = -15, fo = 15, 1 = yy = 5. Yi = V1 = —43 (b) M=S, 0 = —F, a = 4, By = -3, 
Bo = 3, Yu = "12 = Yn = Ya = 0; (c) m= 34, a = —2, ay = 3, 03 = —1, By = 13, By = -24, B3 = -1}, By = 13, yj; = 0 for 


all values of i and j; (d) 4 = 5,01 =—23, a7 =0, 03 = 24, By = —3, Bp = -1, 63 = 1, Bu =3, M1 = 14, 1 = 5. 113 = - 4 
vig = 13, ¥21 = Yn = 123 = 24 =0, ¥31 = —15, 32 = 5. 133 = 9, 134 = 15. 11. U7, = 0.7047; corresponding tail 
area is much larger than 0.05. 13.U Z = 9.0657; corresponding tail area is less than 0.025. 15. The value of 

the appropriate statistic having the r distribution with 12 degrees of freedom is 2.8673; the corresponding tail area is 

between 0.01 and 0.005. 19. a9 + (1 — a9) Bo. 


Section 11.9 


1i— Dy 2 
1. (a) (0.01996, 0.02129); (b) Reject the null hypothesis; (c) (25.35, 26.16). 3.E(T) = oe Var(T) = Soa 
~ 4 (Xi-Xp, 
I I I I 2 2 me - 
Shore [EO] HCL 0) | 
7. Bo = 5 yn xy » B} =Yn — BoXn, where x =x; —X, and Yj = y; — Yn. Either 
= 141 


the plus sign or the minus sign in £» should be used, depending on whether the optimal line has a positive or a negative 
slope. 9. - = Nj [»? + (X%j4 - ¥4)]. 


1 > TI (K-1)(S3 +8344 
11. TI(K—-) Vi, je Vizk = Vij) . 13. Let U= a 


Be aged 
G72 . Reject Hp if U >c. Under Ho, U has an F 


Resid 


distribution with JJ — 1 and 1/(K — 1) degrees of freedom. 15. 6; = iY + Yo) + 5¥3, 6) = iY + Yo) - 5¥3, 
62 = 41% _ 6 _ 65)" + (Y 6 65)? + (¥3 6 + 65)", where Y, = Wi, Yo = Wo 5, Y3= 5 W3; (1, 65) and 62 are 
3 


3 _1 
independent; (6, 0) has a bivariate normal distribution with mean vector (61, 62) and covariance matrix : 3 |e? 
8 


8 


an 2 —_ a 
se has the x? distribution with one degree of freedom. 17. Var(é;) = E A aE 3 fo 19.u=0+V; 
j= n 


T J 
= — = é j0; —_ 5p MO PVG 
a; = 0; —6,and Bj =; — w, where 6 = Diet Yi and yy = ie 


v4 w+ 
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a. : 

w=O444, 

aA =6,,,-0 

i = F144 — 9444: 

a? = 8454-8 +3 

ay =8 rk F444 

AB 

Bij = Fij+ — Ping — F454 FO 444, 

BAS =O itn — Fin4 — O44 + Ong, 

BC 

Big = 9+ jk — P+ j4+ —F4+e +O +44, 

Vije = Cie — Oi — Cie — Os ge FO et O44 FO OO 
Chapter 12 


Note: Answers to exercises that involve simulation are themselves only simulation approximations. Your answers will be 


different. 


Section 12.1 
5. (c) f(x, y) = 0.4%x exp(—0.4[x + y) for x, y > 0, [5° [°° 0.4%x exp(—0.4[x + yPdydx. 


Section 12.2 


5. (b) The k = 2 trimmed mean probably has the smallest M.S.E. 9. 0.2599. 11. (4,404 / Bei)? (uz — ta) has 
the ¢ distribution with 2a,; degrees of freedom, and similarly for jy. 15. (a) r — [log W()]/u. 


Section 12.3 


1. (a) Approximation = 0.0475, sim. std. err. = 0.0018; (b) v = 484. 11. x? distribution with n — p degrees of freedom 
divided by SR 


esid* 


Section 12.4 


7. (a) Z = 0.8343, sim. std. err. = 0.00372; (b) Z’ = 0.8386, sim. std. err. = 0.00003. 17. Look at Exercises 3, 4, 6, and 
10. 


Section 12.5 


5. Approximation = 0.2542, sim. std. err. = 4.71 x 10-4. 7. 826.8, 843.3, 783.3. 9. Means: —0.965, 
0.02059, 1.199 x 107; std. devs.: 2.448 x 107%, 1.207 x 1074, 8.381 x 107°. 11. 0.33, 0.29, 0.30, 0.31, 0.34, 0.30, 
0.62, 0.51, 0.98, 0.83. 13. (b) Both ap and fy must be the same in both priors. In addition, bo = Bp/Ag and 
ag =a; (d) Approximation = (154.67, 215.79), sim. std. err. of endpoints = 10.8v~!/? (based on 10 Markov chains 
of length v each). 15. (a) Conditional on everything else X,,,; has the d.f. F(x) = [1 — ee"). for 
0 <x <c; (b) Conditional on everything else X,,,; has the df. F(x) =1—e~°@-, for x >c. 


Section 12.6 


n 
3: 2 (") (€/n)" 1 - £/ny"—, where ¢ is the number of observations in the original sample that equal the 
i=(n+1)/2 
smallest value. 5. (a) —1.684; (b) About 50,000. 7. (a) 0.107; (b) 1.763; (c) 0.0183. 9. (a) 4.868 x 
10-4; (b) —0.0023; (c) 2.423 x 10-> and 6.920 x 10-4. 11. (b) —0.2694 and 0.5458. 
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Section 12.7 
5. (b) Approximation = 0.581, sim. std. err. = 0.0156; (c) 16,200. 7. (a) 0.9 quantiles around 4.0, 0.95 quantiles 
around 5.2, 0.99 quantiles around 8; (b) The differences are on the same order of magnitude as Monte Carlo 


AD r Legend 
variability; (c) 0.123. 9, (a) exp( Bd uo vo)” ee E , rid Vi) ama ¥) }) 


x pPeoteo—! | a a +[nj+1]/2 1 (b) 6 has a gamma distribution with parameters pay + €g and ¢py + aan T;; (c) Very 


close to the values in Table 12.6. 11. (c) The proportions are rather close to the nominal values. 
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Exact confidence set, 541 
Expectation, 208, 209 
conditional, 256 
does not exist, 208, 210 
exists, 208, 210 
of a function, 213, 215 
of linear function, 217 
Expected value, 208. See also 
Expectation 
Experiment, 5 
augmented, 61-63 
Experimental design, 381, 705 
Exponential distribution, 321 
conjugate prior for, 402 
m.g.f., 322 
mean, 321 
memoryless property, 322 
p.d.f., 321 
variance, 321 
Exponential family, 407, 566 
k-parameter, 455 
Extrapolation, 704 


Factorization criterion, 445, 449 
Factors, 763 

effects, 765 

sum of squares, 767, 776 
Factor sum of squares, 767, 776 
Failure rate, 326 

decreasing, 326 

increasing, 326 
F distribution, 598 

one-way layout, 722 

p.d-f., 598 

relation to ¢ distibution, 598 

two-way layout, 769 
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with replications, 777 
Federal Reserve Board, 736 
Feller, W., 2, 31 
Ferguson, T. S., 384 
Fermat, P., 1 
Finite population correction, 284 
Finkelstein, M. O., 70 
Fisher, N., 500 
Fisher, R. A., 417, 443, 444, 481, 635, 

799 
Fisher information, 515 
for function of parameter, 527 
information inequality, 519 
in a random variable, 515 
in a sample, 517 
for vector parameter, 525 
Fisher information matrix, 525 
Fitted values, 717 
Folks, L., 2 
Forbes, J. D., 698, 718, 719 
Frank, D., xi 
Fraser, D. A. S., 2 
Frequency interpretation of 
probability, 2-3 
Frequentist, 384 
Frey, M., 834 
Friedland, L. R., 396 
Frisby, J. P., 597 
F test, 599 
level, 600 
as likelihood ratio test, 602 
one-way layout, 722 
power function, 600 
two-way layout, 769 
with replications, 777 
Function 

of continuous random variable 
distribution, 168, 172 

of continuous random variables 
distribution, 182 

of discrete random variable 
distribution, 168 

of discrete random variables 
distribution, 175 


Gadidoy, A., Xi, Xi 
Galileo Galilei, 1 
Galton, F., 707 
Gambler’s ruin problem, 86-89, 200 
Gamma distribution, 319 
as conjugate prior, 397, 402 
m.g.f., 320 


888 Index 


Gamma distribution (continued) 
moments, 320 
p.d.f., 319 
relation to Poisson distribution, 

346 

Gamma function, 317 

Gay, A., xii 

Geiger, H., 640 

Geisler, L., xi 

Geisser, S., 334 

Gelfand, A. E., 823 

Gelman, A., 826, 836 


Geman, D., 825 
Geman, S., 825 
Gene, 23 


General linear model, 738 
assumptions, 736 
covariance matrix of estimators, 
743 
hypothesis testing, 745-747 
joint distribution of estimators, 
745 
M.L.E., 740 
mean of estimators, 743 
Genotype, 23 
Gentle, J. E., 172 
Geometric distribution298 
m.g.f., 299 
mean, 299 
memoryless property, 300 
p.f., 298 
variance, 299 
Gibbs sampling, 825 
Gleser, L. J., 2 
Glivenko-Cantelli lemma, 659 
Goel, P,, xi 
Goldberg, L., xii 
Goodness-of-fit test 
x7, 626 
for composite null hypothesis, 635 
Gore, A., 785 
Gosset, W. S., 480 
Gram-Schmidt method, 478, 708 
Grand mean, 764, 774 
Graybill, F A., 2, 738 
Greenhouse, J., xii 
Group testing, 278 
Grunbaum, B. W., 334 


Halmos, P. R., 444 
Hampel, F. R., 674, 720 
Hartpence, K., xii 


Hastings, W. K., 836 
Hazard function, 326 
Heavenrich, R. M., 694 
Hellman, K. H., 694 
Herring, S., xi 

Heska, S., 400 
Heymsfield, S. B., 400 
Hinkley, D. V., 843 
Histogram, 165 
Hitczenko, P.,, xi 


Hoel, P. G., 2 
Hogg, R. V., 2 
Hsu, L., xi 


Huang, W.-M., xi 
Huber, P. J., 667, 672, 674 
Hypergeometric distribution, 282 
binomial approximation, 284 
mean, 283 
Poisson approximation, 292 
variance, 283 
Hyperparameters, 395 
Hypothesis 
alternative, 531 
composite, 532 
null, 531 
one-sided, 532 
simple, 532, 550-557 
two-sided, 532 
Hypothesis testing, 381, 530 
general linear model, 745-747 
one-way layout, 759-760 
two-way layout, 768-770 
with replications, 776-780 
Hypothetically observable random 
variables, 377, 378 


iid., 158 
Image of function, 172 
Importance function, 817 
Importance sampling, 817 
stratified, 820-821 
Improper prior, 387, 403, 502 
simple linear regression, 729 
Inadmissible estimator, 458 
Increasing failure rate, 326 
Independence 
of events 
complements, 68 
conditional, 73 
and conditional probability, 71 
definition, 66, 68 
meaning of, 71 


mutual, 68 
pairwise, 69 
of random variables 
conditional, 163, 164 
definition, 135, 158 
and marignal distributions, 135, 
158 
meaning of, 136 
Independent events, 66, 68 
conditionally, 73 
Independent random variables, 135, 
158, 164 
conditional, 163 
Induction, 42 
Information inequality, 518 
Initial distribution, 196 
Initial probability vector, 196 
Initial state, 188 
In parallel, 167 
In series, 167 
Interactions, 774 
Interaction sum of squares, 776 
Interquartile range, 233 
Intersection, 10, 11 
Interval null hypothesis, 571 
Invariance property of M.L.E., 426 
Inverse gamma distribution, 406 
IOR, 233 
Iyer, H. K., 738 


Jacobian, 183 

Jenkins, G. M., 751 

Jensen’s inequality, 220 

Joint c.d.f., 125, 153 

Joint cumulative distribution 

function, 125 

Joint distribution, 118, 153, 154 
continuous, 120 
discrete, 118 

Joint distribution function, 125 

Jointly sufficient statistics, 449 
minimal, 452 

Joint p.d.f., 154 

Joint p.f., 119, 153 

Joint p.f./p.d.f., 124, 155 

Joint probability function, 119 


Kempthorne, O., 2 

Kirmani, S., xi 

Kolmogorov, A. N, 660 

Kolmogorov-Smirnov test, 661 
two-sample, 664 


Koopman-Darmois family, 407 
k-parameter, 455 

Kronmal, R. A., 813 

Kuh, E., 718 


Laplace distribution, 671 
Larsen, R. J., 2 
Larson, H. J., 2 
Lavine, M., xi 
Lawless, J. F., 312 
Law of large numbers, 352 
strong, 355 
weak, 355 
Law of total probability, 60 
conditional version, 61 
for expectations, 258 
multivariate, 162 
for random variables, 148 
for variances, 261 
Least squares, 692 
Least-squares estimators, 700 
distribution, 702 
general linear model, 740 
simple linear regression, 701 
two-way layout, 765 
with replications, 774 
Least-squares line, 692 
Lehmann, E. L., 384, 619, 637 
Lehoczky, J., xii 
Lepre, C., xii 
Leroy, A. M., 720 
Level of significance, 536, 546 
observed, 539 
relation to sample size, 617 
Level of test, 536 
Levels of factors, 763 
Levin, B., 70 
Levine, R., xi 
Lieblein, J., 312 
Likelihood function, 390, 418 
Likelihood ratio, 552 
Likelihood ratio statistic, 544 
Likelihood ratio test, 544, 583, 594 
F test, 602 
large-sample, 545 
for proportions, 630 
two-sample ¢ test, 592 
Lindgren, B. W., 2 
Linear function 
of bivariate normal random 
vector, 342 
covariance matrix, 742 


distribution, 169, 178 
of independent normal random 
variables, 310 
mean of, 217 
moment generating function of, 
237 
of normal random variable, 306 
standard deviation, 229 
variance, 229, 253, 703 
Linear regression 
general linear model, 736 
multiple, 738 
simple, 700 
Linear transformation 
p.d.f. of, 186 
Liukkonen, J., xi 
Loch, S., xi 
Lockwood, J. R., 834 
Lognormal distribution, 312 
Lorenzen, T. J., 302 
Loss function, 409, 412, 415 
absolute error, 411 
hypothesis testing, 606, 607 
squared error, 410 
Lower quartile, 115 
Lubischew, A. A., 339 
Lyle, R. M., 596 


M.A.E., 245 
m.g.f., 236. See also Moment 
generating function 
M.L.E. See Maximum likelihood 
estimator 
M.S.E. See Mean squared error 
Main effects of factors, 774 
Manly, B. F. J., 531 
Mann, H. B., 680 
Marginal c.d-f., 131 
Marginal distribution, 130 
of Markov chain, 197 
Marginal p.d-f., 131 
Marginal p.f., 131 
Markov chain, 188, 825 
convergence, 199, 825 
initial distribution, 196 
stationary distribution, 198, 199 
transition distribution, 190 
stationary, 190 
transition matrix, 191 
Markov chain Monte Carlo, 825 
Markov inequality, 349 
Markowitz, H., 231 
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Marx, M. L., 2 
Matching problem, 49 
Matzkin, R., xi 
Maximum likelihood estimate, 418 
Maximum likelihood estimator, 418 
asymptotic distribution, 523 
consistency, 428 
of a function, 427 
general linear model, 740 
invariance property, 426 
limitations of, 422 
relation to Bayes estimator, 432 
relation to sampling plan, 439 
relation to sufficient statistic, 
453 
simple linear regression, 701 
two-way layout, 765 
with replications, 774 
Maximum of random sample, 180 
McCabe, G. P., 471, 487, 707, 754 
McConnell, T., xi 
Mean, 208, 209 
conditional, 256, 257 
does not exist, 208, 210 
exists, 208, 210 
of a function, 213, 215 
infinite, 208-209 
of linear function, 217 
sample, 310, 474 
Mean absolute error, 245 
Mean square, 758, 767 
Mean squared error, 244 
and bias, 507 
prediction, 704 
Mean vector, 741 
Median, 115-116, 241 
sample, 458, 667 
Median absolute deviation, 670 
Memoryless property 
of exponential distribution, 322 
of geometric distribution, 300 
Mendenhall, W., 2 
M-estimator, 672 
Method of moments estimator, 430, 
431 
Metropolis, N., 823, 836 
Metropolis algorithm, 836 
Meyer, P. L., 2 
Miller, L., 2 
Miller, M., 2 
Minimal jointly sufficient statistic, 
452 
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Minimal sufficient statistic, 452, 453, 
454 
Minimum of random sample, 180 
Minimum variance unbiased 
estimator, 522 
MLR 
i. See Monotone likelihood ratio 
Mode, 280 
Moment, 234 
central, 235 
sample, 430 
Moment generating function, 236 
uniqueness, 238 
Monotone likelihood ratio, 560 
and uniformly most powerful test, 
562 
Monte Carlo analysis, 791 
Mood, A. M., 2 
Moore, D. S., 471, 487, 707, 754 
Morrison, D. F., 343 
Mueller, H.-G., xi 
Miiller, M. E., 805 
Multinomial coefficient, 43 
Multinomial distribution, 334 
covariance, 336 
mean, 336 
p.d.f., 334 
relation to binomial distribution, 
335 
relation to Poisson distribution, 
337 
variance, 336 
Multinomial theorem, 43, 46 
Multiple linear regression, 738 
Multiple R*. See R? 
Multiple step transition matrix, 
194 
Multiplication rule 
for conditional probabilities, 
58-59 
for counting, 26-27 
for distributions, 147 
Multivariate Bayes’ theorem, 162 
Multivariate law of total probability, 
162 
Multivariate normal distribution, 
TAL 
Mutually exclusive events, 11, 72 
Mutually independent events, 68, 72 
Myers, R., xi 


Name of distribution, 99 


Negative binomial distribution, 298 
extended definition, 301 
m.g.f., 299 
mean, 299 
p.f£, 297 
relation to binomial distribution, 

345 
variance, 299 

Negative binomial distribution 

Poisson approximation, 302 

Negatively correlated, 251 

Newton’s method, 429 

Neyman, J., 444, 553 

Neyman-Pearson lemma, 553 

Nickless, G., 590 

Nocedal, J., 430 

Noncentrality parameter, 579, 580 

Noncentral r distribution, 579 

Nonparametric bootstrap, 840, 

843-845 

Nonparametric methods, 625 

Nonparametric problems, 625 

Normal distribution, 303 
as conjugate prior, 398 
conjugate prior for, 398 
m.g.f., 304 
mean, 305 
p.d.f., 303 
standard, 307 
variance, 305 

Normal equations, 692, 693 

Normal-gamma distribution, 497 

Normalizing constant, 105, 391 

Null hypothesis, 531 
interval, 571 


Observable random variables, 377, 
378 

Observed level of significance, 539 

Olkin, I., 2 

Olsen, A., 473 

One-sided althernative, 562 

One-sided hypothesis, 532 

One-way layout, 755 

Bayesian analysis, 831 

Ordered sampling with replacement, 
35 

Order statistics, 451 

Ore, O., 2 

Orthogonal matrix, 476-478 

Outcome, 6-7 

Outlier, 674, 718, 719 


Overall mean, 764, 774 


p.d.f., 101 
conditional, 144, 146, 160 
joint, 154 
marginal, 131 
nonuniqueness of, 102 
p-f., 96 
conditional, 142, 146, 160 
joint, 119, 153 
marginal, 131 
p-f£./p.d.f., 124 
conditional, 160 
joint, 155 
Parallel, 167 
Parameter, 377, 378 
as limit of random variables, 383 
Parameter space, 378 
Parametric bootstrap, 840, 845-848 
Pareto distribution, 326 
as conjugate prior, 407 
Parker, A. J., 590 
Partition, 60 
Pascal, B., 1 
Pearson, E. S., 553 
Pearson, K., 626 
Percentile, 112 
Percentile bootstrap confidence 
interval, 843 
Percentile t bootstrap confidence 
interval, 844 
Permutations, 28 
Peruggia, M., xi 
Peterson, A. V., 813 
Piland, N. F., 500 
Pivotal, 489 
Placebo, 57 
Poisson approximation 
to binomial distribution, 291 
to hypergeometric distribution, 
292 
to negative binomial distribution, 
302 
Poisson distribution, 288 
conjugate prior for, 397 
m.g.f., 290 
mean, 289 
relation to gamma distribution, 
346 
variance, 290 
Poisson process, 293 
assumptions, 294 


inter-arrival times, 324 
Port, S., 2 
Positively correlated, 251 
Posterior distribution, 387 

approximate normality, 524 
Posterior hyperparameters, 395 
Posterior probability, 80 
Power function, 534 

ANOVA, 760 

x° goodness-of-fit test, 850 

F test, 600 

general linear model, 747 

sign test, 680 

t test, 579, 582, 811 

two-sample ¢ test, 590 


Wilcoxon-Mann- Whitney test, 682 


Precision, 495 
Prediction, 380 

general linear model, 747-748 
Prediction interval, 717 

Bayesian inference, 732 
Predictor, 699 
Presidential election (2000), 785 
Prien, R. F., 57 
Prior distribution, 385 

conjugate family, 395, 496 

improper, 403, 502 
Prior hyperparameters, 395 
Prior probability, 80 
Probability, 17 

conditional, 56 
Probability density function, 101 
Probability function, 96 

joint, 119 
Probability integral transformation, 

170, 804 

Probability measure, 17 
Probability vector, 196 
Pseudo-random numbers, 170 
p-value, 539 

Bernoulli parameter, 540 

F test, 600 

and posterior probability, 616 

and test statistic, 539 

t test, 578 

two-sided, 583 
two-sample, 589, 591 


Q-Q plot, 720 
Quantile, 112 

sample, 670 
Quantile function, 112 


Quantile plot, 720 

Quartile, 115 
lower, 115 
upper, 115 

Quetelet, A., 412 


R?, 748, 753 
Ralescu, S., xi 
Randall-Maciver, R., 531 
Randomized response, 462 
Randomized test, 556 
Random number generator, 170 
Random process, 188 
Random sample, 158 
Random variables, 93 
continuous, 101 
conditional distribution, 144 
expectation, 209 
function of, 168, 172 
joint distribution, 120 
discrete, 95 
conditional distribution, 142 
expectation, 208 
function of, 168 
joint distribution, 118 
distribution, 94 
marginal, 130 
expectation of function, 213, 215 
expectation of product, 251 
independent, 135 
negatively correlated, 251 
positively correlated, 251 
standard deviation, 226 
uncorrelated, 251 
variance, 226 
of sum, 253 
Random vector, 153 
Range of random sample, 181 
Rank test 
paired observations, 684-685 
power function, 682 
Wilcoxon-Mann-Whitney, 681 
Rao, C. R., 384, 457, 518 
Ravishankar, K., xi 
Regression. See Linear regression 
Regression coefficients, 699 
confidence interval, 715-716 
hypothesis testing, 712-715 
joint confidence set, 722 
simultaneous inference, 721-726 
Reinsel, G. C., 751 
Reject hypothesis, 531, 545 
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Rejection region, 533, 546 

Reliability, 167 

Replications, 773 

Residual mean square, 758 

Residuals, 717, 749, 760 

Residual sum of squares, 757, 767, 
776 

Response, 699 

Rice, J. A., 2 

Risk-neutral price, 215 

Robust estimator, 460, 666, 667 

Robust linear regression, 837 

Rohatgi, V. K., 384 

Rohlf, F. J., 640 

Rousseauw, P. J., 720 

Rubenstein, R. Y., 172 

Rutherford, E., 640 


Sample c.d.f., 658 
Sample distribution function, 658 
Sample mean, 310, 474 
Sample median, 458, 667 
Sample moment, 430 
Sample quantile, 670 
asymptotic distribution, 677 
Sample size, 158 
Sample space, 6, 7 
simple, 23 
Sample variance, 421, 474 
Sampling distribution, 465 
Sampling without replacement, 27 
Sampling with replacement, 29 
ordered, 35 
unordered, 35 
Saphire, D., xi 
Savage, L. J., 444 
Scale parameter, 670 
Schaeffer, R. L., 2 
Scheffé, H., 723, 760, 781 
Schervish, M. J., 383, 384, 428, 432, 
504, 523, 524, 610, 635, 677 
Scholes, M., 313, 799 
Schwarz inequality, 250 
Sensitivity analysis, 387, 460 
Sepanski, S., xi 
Sepulveda, D., 400 
Serial dependence, 750 
Series, 167 
Sestrich, H., xii 
Set, 6 
Set theory, 7-13 
Sharpe, R. H., 611 
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Sign test, 679 
power function, 680 
Simple hypothesis, 532, 550-557 
Simple linear regression, 700 
assumptions, 700 
Bayes test, 733 
distribution of estimators, 709 
improper prior, 729 
M.L.E., 701 
posterior distribution, 729-731 
prediction interval, 716 
robust, 837 
Simple sample space, 23 
Simpson, J., 473 
Simpson’s paradox, 653-656 
Simulation, 170, 787, 788 
discrete random variables, 
812-814 
notation, 792 
probability integral transforma- 
tion, 170, 804 
Simulation distribution, 794 
Simulation size, 798 
Simulation standard error, 796 
of an average, 796 
of a sample quantile, 797 
of a smooth function, 796 
Simulation variance, 795, 796 
Size of test, 536, 546 
Skewness, 235 
Smirnov, N. V., 660 
Smith, A. F M., 823 
Smith, H. L., 500, 718, 738, 750 
Sokal, R. R., 640 
Squared error loss, 410 
Standard deviation, 226 
infinite, 226 
Standard normal distribution, 307 
State of process, 188 
initial, 188 
Stationary distribution, 198, 199 
Stationary transition distribution, 
190 
Statistic, 382 
x7, 626 
sufficient, 444, 449 


Statistical decision problem, 269, 381 


Statistical inference, 378 
Statistical model, 377 
Statistical significance 
relation to practical significance, 
619 


Stein, C., 511 
Stigler, S. M., 2, 412 
Stirling’s formula, 31, 318 
Stochastically larger, 683 
Stochastic matrix, 191 
Stochastic process, 188 
Stone, C. L., 2 
Stratified importance sampling, 
820-821 
Strong convergence, 355 
Strong law of large numbers, 355 
Student, 480 
Subjective interpretation of 
probability, 3-4 
Subset, 6, 7 
Sufficient statistic, 444, 449 
limitations of, 459 
minimal, 452, 453, 454 
Sum of squares 
between, 757 
factor, 767, 776 
interaction, 776 
residual, 757, 767, 776 
total, 757, 766, 775 
Support, 96, 101, 121 


Tail area, 539. See also p-value 
Tan, H., xii 
Tanis, E. A., 2 
Taylor’s theorem, 225 
two-dimensional, 803 
t distribution, 480 
moments, 480-481 
p.d.f., 480 
relation to F distibution, 598 
relation to normal distribution, 
481 
variance, 481 
Test, 531 
Bayes, 606 
and confidence sets, 540 
randomized, 556 
UMP, 560 
unbiased, 573 


Testing hypotheses. See Hypothesis 


testing 
Test procedure. See Test 
Test statistic, 533 
Thiru, K., xii 
Thomson, A., 531 
Tibshirani, R., 843 
Tierney, L., 825 


Todhunter, I., 2 
Total sum of squares, 757, 766, 775 
Transition distribution, 190 


stationary, 190 


Transition matrix, 191 


multiple step, 194 


Trigamma function, 430 
Trimmed mean, 670 
Troske, K., xii 

t test, 577 


level, 577 

as a likelihood ratio test, 583, 592 
power function, 579, 582, 590, 811 
p-value, 578, 583, 589 
two-sample, 588 

unbiased, 577 


Tubb, A., 590 

Tukey, J. W., 667 
Twain, M., 51 
Two-sample t¢ test, 588 


p-value, 589 


Two-sided alternative, 565, 568-574 
Two-sided hypothesis, 532 
Two-stage test, 778, 807 

Two-way layout, 763 


with replications, 773 
unequal numbers, 780 


Type I error, 535 
Type IJ error, 535 


Cie Gea 


Cc 


one 


MP test. See Uniformly most 
powerful test 

nbiased estimator, 507, 511 

with minimum variance, 522 

nbiased test, 573 

ncorrelated, 251 

ncountable, 8, 13 

niform distribution on integers, 97 

niform distribution on interval, 
103 

conjugate prior for, 407 

niformly most powerful test, 560 

and monotone likelihood ratio, 
562 

nion, 9 

probability of, 19, 46-48 

nordered sampling with 
replacement, 35 

pper quartile, 115 

tility function, 265, 415 


Value at risk, 113 


Van Middelem, C. H., 611 
Van Ness, J., xii 
Vardi, Y., Xii 
Variance, 226 
conditional, 260 
does not exist, 226 
infinite, 226 
sample, 421, 474 
of sample mean, 350 
simulation, 795, 796 
of sum of independent random 
variables, 230 
of sum of random variables, 253 
Variance stabilizing transformation, 
365 
Vaynberg, Y., xii 


Vector notation, 153 
Venn diagram, 9 
Ventura, V., xii 
Verducci, J., Xi 
Vezveai, M., xii 
Vidakovic, B., xii 
Vorwerk, K., xii 


Wackerly, D. D., 2 

Walker, A. J., 813 

Warren, B., xii 

Weak convergence, 355 

Weak law of large numbers, 355 
Weibull distribution, 326 

Weisberg, S., 699, 718, 720, 738, 750 
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Welch, B. L., 593 
Welsch, R. E., 718 
Whitney, D. R., 680 
Wilcoxon, F., 685 
Wilcoxon-Mann-Whitney ranks test, 
680 
power function, 682 
ties, 682 
Williams, C. L., xii 
Winsor, C. P., 404 
Wolff, L., xii 
Wright, S., 430 


Young, G. A., 843 


Zelen, M., 312 


Discrete Distributions 


Bernoulli with parameter p 


Binomial with parameters n and p 


pf. f(x) = p*(l— py, fx) = ({)p*— py", 
forx =0,1 forx=0,...,n 
Mean Dp np 
Variance | p(1— p) np(1— p) 
m.g.f. w(t)=pe'+1—p w(t) =(pe' +1— p)" 
Uniform on the integers a,...,b Hypergeometric with parameters A, B, andn 
A B 
pf. fo) =p. fa) = See, 
forx=a,...,b for x = max{0, n — b},..., min{n, A} 
b-+a nA 
Mean ora A+B 
. (b—a)(b—a+2) nAB  A+B-—n 
Variance oa @—at)) (A+B)2 A+B—1 
b+1)t at . 2 
m.g.f. w(t) = oon Nothing simpler than y(t) = >>, f(x)e™* 
Geometric with parameter p Negative binomial with parameters r and p 
p-f. f(x) = p— py, Fa) = (FE) vy, 
forx =0,1,... forx =0,1,... 
l-p rd=p) 
Mean 3 ve 
Variance | 12 ae 2, 
P P : 
m.g.f. v0) = aoe vt) = (eae) , 
for t < log(1/[1 — p]) for t < log(1/[1 — p]) 
Poisson with mean 1 Multinomial with parameters n and (pj, ..., Px) 
pf. f(x) = ek, f(™, seed Xx) = ete bene a 
for x = 0, 1;.<.: for xy +---+x, =n and all x; >0 
Mean a E(X;) =np;, 
fori=1,...,k 
Variance | i Var(X;) = np; (1 — p;), Cov(X;, Xj) = —np; Pj; 
ford, j=lpccxn, K 
m.g.f. v(t) = ere'—D Multivariate m.g.f. can be defined, 


but is not defined in this text. 


Continuous Distributions 


Beta with parameters a and 6 


Uniform on the interval [a, b] 


pdf | 0) = partys 1h, fe) = pia 
forO<x <1 fora<x <b 
Mean zip ath 
‘ ap (b-a)? 
Variance wap arpa D 
o z a —at__ y—bt 
m.g.f. Not available in simple form w(t)= Tea 
Exponential with parameter 6 Gamma with parameters a and f 
pat. — | fox = per, FG) = vege te™, 
forx >0 forx >0 
1 a 
Mean 3 B 
‘ 1 
Variance z zB 
p p_\“ 
mgt | yi) = 34, vin =(94) . 
fort < B fort <p 
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p.d.f. {x)= Gait exp (-S) Formula is too large to print here. 
See Eq. (5.10.2) on page 338. 
Mean Le E(X;) = uj, 
fori=1,2 
2 
Variance | o7 Covariance matrix: ( °1 te ) 
ad P0107 05 
m.g.f. w(t) =exp (us + rs) Bivariate m.g.f. can be defined, 


but is not defined in this text. 


Continuous Distributions 


Lognormal with parameters jz and 07 


F with m and n degrees of freedom 
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Not finite for t 40 
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Corrections to Probability and Statistics (Fourth Edition) 


This file was last updated December 10, 2010 


e p. 401, last displayed equaiton: “0.1154)” should read “0.1154” (11/15/10) 
e p. 401, last displayed equaiton: “(1.12)” should read “(0.4116)” (11/15/10) 


e p. 536, next line after (9.1.8) “Example 6.5.15” should read “Example 
7.5.7” (11/15/10) 


e p. 549 Exercise 11: “6.” should read “6” at the ends of parts b and c. 
e p. 591, Example 9.6.4, line 2: “w =” should read “|w| =”. (11/15/10) 


e p. 612, 2 and 4 lines after (9.8.16): In both places, “T7’,(1— a)” should 
read “—T7*,(1 — ag)” (11/15/10) 


e p. 682, first line of “Ties” secttion: “signed ranks test” should read “ranks 
test”. (11/15/10) 


e p. 683, Eq. (10.8.6): The correct formula is 


Var(S) = mn (Pr(X1 > Yi) — (m+ n-— 1) Pr(X1 > Vi)? 
+(n = 1) Pr(X, > Y,, X41 > Y3) + (m — 1) Pr(X1 > Y,, Xo > Y,)) ¥ 


p. 685, Exercise 15: There are a few errors in the statement of the problem. 
The corrected exercise is given here: 


15. Consider again the conditions of Exercise 1. This time, let D; = X;— 
Y;. Wilcoxon (1945) developed the following test of the hypotheses 
(10.8.7). Order the absolute values |Dj|,...,|D,| from smallest to 
largest, and assign ranks from 1 to n to the values. Then Sy is set 
equal to the sum of all the ranks of those |D,| such that D; > 0. 
If the distribution of D; is symmetric around 0, then the mean and 
variance of Sy are 


E(Sw) = wees (10.8.8) 
Var(Sw) = ee (10.8.9) 


The test rejects Ho if Sy > c, where c is chosen to make the test 
have level of significance ag. This test is called the Wilcoxon signed 
ranks test. If n is large, a normal distribution approximation allows 
us to use c= E(Sw) + ®-!(1 — ag) Var(Sw)!/?. 


a. Let W; = 1 if the |D,| that gets rank 7 has D; > 0 and W; = 0 
if not. Show that Sy = >>, iWi. 


b. Prove that E(Sy) is as stated in Eq. (10.8.8) under the assump- 
tion that the distribution of D; is symmetric around 0. Hint: 
You may wish to use Eq. (4.7.13). 

c. Prove that Var(Sy) is as stated in Eq. (10.8.9) under the as- 
sumption that the distribution of D; is symmetric around 0. 
Hint: You may wish to use Eq. (4.7.14). 


e p. 703, 3rd displayed equation: “0.382” should read “—0.382” (11/15/10) 

e p. 704, Example 11.2.5, line 3: “—81.049” should read “—81.06” (11/15/10) 

* p. 708, first line after (11.3.2): “S),_, a1j@2;” should read “7"_, a1ja2;”. 
(11/15/10) 

e p. 722, Eq. (11.3.30), end of first line: “3%,” should read “97”. (11/15/10) 


e p. 728, Exercise 22: The problem should have asked for the regression 
of logarithm of 1980 price on the logarithm of 1970 price. It can be 
solved either as stated here or as stated in the text, but the regression on 
logarithm of 1970 price makes more sense. (11/15/10) 


se a a 29 
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e p. 731, end of first displayed equation: “— =,” should read “— ars , 
T iat Fj Dia Tj 


(11/15/10) 
e p. 731, two lines after (11.4.6): “—7?/2” should read “—7/2”. (11/15/10) 
° p. 732, first line after (11.4.7): “15” should read “14”. (11/15/03) 
e p. 734, Example 11.4.4, line 6: “7.191” should read “7.181”. (11/15/10) 


e p. 741, Example 11.5.3, end of first displayed equation: “144.1” should 
read “172.3”. (11/15/10) 


e p. 743, first line after the end of Theorem 11.5.3: “7 =1,...,n” should 
read “j =0,...,p—1”. (11/15/10) 


e p. 749, first displayed equation: “zi089 — ++: — Zip—1Gp—1” should read 
“zi Bo — +++ — Zip-1Bp—1”. (11/15/10) 


e p. 752, second line of “Summary” section: Bots Sipe: Baar should 
read “Zio Po ete Sle Zip—1Bp—1 . (11/15/10) 


e p. 752, Exercise 2, line 2: “S$? has the y?” should be “S?/c? has the x2”. 
(11/15/10) 


e p. 758, line 5: “Eq. (11.6.8) has the” should read “Eq. (11.6.8), when 
divided by o?, has the”. (11/15/10) 


e p. 761, Exercise 2, displayed equation: Both places where o? appears in 
a denominator should be a. (11/15/10) 


e p. 763, Exercise 14(a): “S°?_, a;” should be “37?_, nia,” (11/15/10) 


e p. 805, Example 12.3.2, first row of last displayed equation: “f (exp[(y? + y3)/2],” 
should read “f (exp[—(y? + y3)/2],”. (11/15/10) 


e p. 812, Example 12.3.10, last line: “is” should read “if”. (11/15/10) 


p. 832-834, the numerical part of Example 12.5.6 suffered from an error in 
the input data. The correct analysis changes the conclusions stated in the 
text. Corrected text can be found by following the link to more extensive 
corrections at the top of the web page. 


p. 838, Exercise 12, line 5: “variance yo” should read “precision yo” 
(11/15/10) 


e p. 838, Exercise 13(a): The “1/2” exponent should be “—1/2” in the 
displayed formula. (11/15/10) 


e p. 838-839, Exercise 14: Add the text “The prior hyperparameters are 
ao = 0.5, wo = 0, Ao = 1, and Bo = 0.5.” (11/15/10) 


e p. 839, Exercise 15 part a line 2: Delete the text “if Xn4; < c, then” 
(11/15/10) 
e p. 839, Exercise 15 part b line 2: Delete the text “if X,4; > c, then” 


(11/15/10) 
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Chapter 12 Simulation 


As an example, use the hot dog calorie data from Example 11.6.2. In this 
example, p= 4. We shall use a prior distribution in which Aj = a9 = 1, 69 = 0.1, 
ug = 0.001, and wo = 170. We use k = 6 Markov chains and do m = 100 burn-in 
simulations, which turn out to be more than enough to make the maximum of all 
nine F' statistics less than 1 + 0.44m. We then run each of the six Markov chains 
another 10,000 iterations. The samples from the posterior distribution allow us 
to answer any questions that we might have about the parameters, including 
some that we would not have been able to answer using the analysis done in 
Chapter 11. For example, the posterior means and standard deviations of some 
of the parameters are listed in Table 12.6. To see how different the variances are, 
we can estimate the probability that the variance of one group is at least 2.25 
times as high as that of another group by computing the fraction of iterations 
£ in which at least one 79 / oe > 2.25. The result is 0.4, indicating that there 
is some chance that at least some of the variances are different. If the variances 
are different, the ANOVA calculations in Chapter 11 are not justified. 

We can also address the question of how much difference there is between 
the j4;’s. For comparison, we shall do the same calculations that we did in 


Example 12.3.7. In 99 percent of the 60,000 simulations, at least one [i = | > 


26.35. In about one-half of the simulations, all we? = | > 2.224. And in 
99 percent of the simulations, the average of the differences was at least 13.78. 
Figure 12.9 contains a plot of the sample c.d.f.’s of the largest, smallest, and 
average of the six |; — juj| differences. Careful examination of the results in this 
example shows that the four p;’s appear to be closer together than we would have 
thought after the analysis of Example 12.3.7. This is typical of what occurs when 
we use a proper prior in a hierarchical model. In Example 12.3.7, the ju;’s were 
all independent, and they did not have a common unknown mean in the prior. 
In Example 12.5.6, the u;’s all have a common prior distribution with mean 1, 
which is an additional unknown parameter. The estimation of this additional 
parameter allows the posterior distributions of the ju;’s to be pulled toward a 


Table 12.6 Posterior means and standard deviations for 
some parameters in Example 12.5.6 


Type Beef Meat Poultry Specialty 
i 1 2 3 4 
E(u;\y) 156.6 158.3 120.5 159.6 
(Var(uily))/2 = 4.893 5.825 5.521 7.615 
E(1/tily) 495.6 608.5 542.9 568.2 
(Var(1/ri\y))/2 166.0 221.2 201.6 307.4 


E(w|y) = 151.0 (Var(a|y))!/? = 11.16 


Example 12.5.7 


12.5 Markov Chain Monte Carlo 833 


Sample df. 


Difference 


Figure 12.9 Sample c.d.f.’s of the maximum, average, and mini- 
mum of the six |/0; — ju;| differences for Example 12.5.6. 


location that is near the average of all of the samples. With these data, the 
overall sample average is 147.60. <1 


Prediction 


All of the calculations done in the examples of this section have concerned 
functions of the parameters. The sample from the posterior distribution that 
we obtain from Gibbs sampling can also be used to make predictions and form 
prediction intervals for future observations. The most straightforward way to 
make predictions is to simulate the future data conditional on each value of the 
parameter from the posterior sample. Although there are more efficient methods 
for predicting, this method is easy to describe and evaluate. 


Calories in Hot Dogs. In Example 12.5.6, we might be concerned with how 
different we should expect the calorie counts of two hot dogs to be. For example, 
let Y; and Y3 be future calorie counts for hot dogs of the beef and poultry 
varieties, respectively. We can form a prediction interval for D= Yj, — Y3 as 
follows. For each iteration @, let the simulated parameter vector be 


£ £ £ £ L £ £ £ 
9) = ( a is’ ys Z ie’ rf : rr), TY, w, Bo). 


For each @, simulate a beef hot dog calorie count yo having the normal 


distribution with mean pO and variance 1/ 7), Also simulate a poultry hot 


dog calorie count vy? having the normal distribution with mean ys and 


variance 1/ 7. Then compute D‘ = y(? - yi? . Sample quantiles of the values 


D®,..., D690) can be used to estimate quantiles of the distribution of D. 


834 


Chapter 12 Simulation 


For example, suppose that we want a 90 percent prediction interval for D. 
We simulate 60,000 D™ values as above and find the 0.05 and 0.95 sample 
quantiles to be —18.49 and 90.63, which are then the endpoints of our prediction 
interval. To assess how close the simulation estimators are to the actual quantiles 
of the distribution of D, we compute the simulation standard errors of the two 
endpoints. For the samples from each of the k = 6 Markov chains, we can compute 
the sample 0.05 quantiles of our D values. We can then use these values as 
Z1,...,Z6 in Eq. (12.5.1) to compute a value S. Our simulation standard error 
is then S/ 6!/2. We can then repeat this for the sample 0.95 quantiles. For the two 
endpoints of our interval, the simulation standard errors are 0.2228 and 0.4346, 
respectively. These simulation standard errors are fairly small compared to the 
length of the prediction interval. < 
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Preface 


This manual contains solutions to all of the exercises in Probability and Statistics, 4th edition, by Morris 
DeGroot and myself. I have preserved most of the solutions to the exercises that existed in the 3rd edition. 
Certainly errors have been introduced, and I will post any errors brought to my attention on my web page 
http://www.stat.cmu.edu/ mark/ along with errors in the text itself. Feel free to send me comments. 

For instructors who are familiar with earlier editions, I hope that you will find the 4th edition at least as 
useful. Some new material has been added, and little has been removed. Assuming that you will be spending 
the same amount of time using the text as before, something will have to be skipped. I have tried to arrange 
the material so that instructors can choose what to cover and what not to cover based on the type of course 
they want. This manual contains commentary on specific sections right before the solutions for those sections. 
This commentary is intended to explain special features of those sections and help instructors decide which 
parts they want to require of their students. Special attention is given to more challenging material and how 
the remainder of the text does or does not depend upon it. 

To teach a mathematical statistics course for students with a strong calculus background, one could safely 
cover all of the material for which one could find time. The Bayesian sections include 4.8, 7.2, 7.3, 7.4, 8.6, 
9.8, and 11.4. One can choose to skip some or all of this material if one desires, but that would be ignoring 
one of the unique features of the text. The more challenging material in Sections 7.7—7.9, and 9.2—9.4 is really 
only suitable for a mathematical statistics course. One should try to make time for some of the material in 
Sections 12.1—-12.3 even if it meant cutting back on some of the nonparametrics and two-way ANOVA. To teach 
a more modern statistics course, one could skip Sections 7.7—7.9, 9.29.4, 10.8, and 11.7-11.8. This would 
leave time to discuss robust estimation (Section 10.7) and simulation (Chapter 12). Section 3.10 on Markov 
chains is not actually necessary even if one wishes to introduce Markov chain Monte Carlo (Section 12.5), 
although it is helpful for understanding what this topic is about. 


Using Statistical Software 


The text was written without reference to any particular statistical or mathematical software. However, 
there are several places throughout the text where references are made to what general statistical software 
might be able to do. This is done for at least two reasons. One is that different instructors who wish to use 
statistical software while teaching will generally choose different programs. I didn’t want the text to be tied 
to a particular program to the exclusion of others. A second reason is that there are still many instructors 
of mathematical probability and statistics courses who prefer not to use any software at. all. 

Given how pervasive computing is becoming in the use of statistics, the second reason above is becoming 
less compelling. Given the free and multiplatform availability and the versatility of the environment R, even 
the first reason is becoming less compelling. Throughout this manual, I have inserted pointers to which R 
functions will perform many of the calculations that would formerly have been done by hand when using this 
text. The software can be downloaded for Unix, Windows, or Mac OS from 
http://www.r-project.org/ 

That site also has manuals for installation and use. Help is also available directly from within the R envi- 
ronment. 

Many tutorials for getting started with R are available online. At the official R site there is the detailed 
manual: http: //cran.r-project.org/doc/manuals/R-intro.html 
that starts simple and has a good table of contents and lots of examples. However, reading it from start to 
finish is not an efficient way to get started. The sample sessions should be most helpful. 

One major issue with using an environment like R is that it is essentially programming. That is, students 
who have never programmed seriously before are going to have a steep learning curve. Without going into 
the philosophy of whether students should learn statistics without programming, the field is moving in the 
direction of requiring programming skills. People who want only to understand what a statistical analysis 
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is about can still learn that without being able to program. But anyone who actually wants to do statistics 
as part of their job will be seriously handicapped without programming ability. At the end of this manual 
is a series of heavily commented R programms that illustrate many of the features of R in the context of a 
specific example from the text. 


Mark J. Schervish 
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Chapter 1 


Introduction to Probability 


1.2 Interpretations of Probability 


Commentary 


It is interesting to have the students determine some of their own subjective probabilities. For example, let 
X denote the temperature at noon tomorrow outside the building in which the class is being held. Have each 
student determine a number x; such that the student considers the following two possible outcomes to be 
equally likely: X < x; and X > x1. Also, have each student determine numbers x2 and x3 (with x2 < x3) such 
that the student considers the following three possible outcomes to be equally likely: X <a, 42 < X < 43, 
and X > x3. Determinations of more than three outcomes that are considered to be equally likely can also 
be made. The different values of 7; determined by different members of the class should be discussed, and 
also the possibility of getting the class to agree on a common value of 7}. 

Similar determinations of equally likely outcomes can be made by the students in the class for quantities 
such as the following ones which were found in the 1973 World Almanac and Book of Facts: the number 
of freight cars that were in use by American railways in 1960 (1,690,396), the number of banks in the 
United States which closed temporarily or permanently in 1931 on account of financial difficulties (2,294), 
and the total number of telephones which were in service in South America in 1971 (6,137,000). 


1.4 Set Theory 


Solutions to Exercises 


1. Assume that x € B®. We need to show that 2 € A‘. We shall show this indirectly. Assume, to the 
contrary, that x € A. Then x € B because A C B. This contradicts x € B®. Hence x € A is false and 
xe AS. 


2. First, show that AN (BUC) Cc (AN B)U(ANC). Let xe AN(BUC). Thenz € Aandxe BUC. 
That is, c € A and either c € B or x € C (or both). So either (x € A and z € B) or (t@ EC A 
and z € C) or both. That is, either sr € AN B or x € ANC. This is what it means to say that 
xz € (ANB)U(ANC). Thus AN(BUC) c (ANB)U(ANC). Basically, running these steps backwards 
shows that (ANB) U(ANC)C AN(BUC). 


3. To prove the first result, let x € (AU B)°. This means that x is not in AU B. In other words, z is 
neither in A nor in B. Hence x € A° and x € B*. Sox € ACN B®. This proves that (AU B)* Cc ASN B®. 
Next, suppose that « € A°N B®. Then x € A® and x € B®. So zw is neither in A nor in B, so it can’t be 
in AUB. Hence « € (AU B)°. This shows that A°N B® Cc (AU B)°. The second result follows from 
the first by applying the first result to A° and B® and then taking complements of both sides. 
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To see that ANB and ANB“ are disjoint, let x € ANB. Then z € B, hence x ¢ BS andsoz ¢ ANB. So 
no element of ANB is in AN B®, hence the two events are disjoint. To prove that A = (ANB)U(ANB*), 
we shall show that each side is a subset of the other side. First, let x € A. Either x € B or x € B®. If 
zé€B,thn2ze¢€ ANB. Ifa ce B°, thn x €¢ ANB. Hither way, c € (AN B)U(ANB?*). So every 
element of A is an element of (AN B)U(ANB°*) and we conclude that A C (AN B)U(ANB’). Finally, 
let c € (AN B)U(ANB?*). Then either « € ANB, in which case x € A, or  € AN B®, in which 
case x € A. Either way x € A, so every element of (AN B)U(ANM B*) is also an element of A and 
(AN B)U(ANB) CA. 


. To prove the first result, let « € (U;A;)°. This means that x is not in U;A;. In other words, for every 


i € I, x is not in A,;. Hence for every 1 €¢ I, x € Af. So x € MAS. This proves that (U,A;)° C MAS. 
Next, suppose that « € N;A. Then x € A§ for every i € I. So for every i € I, x is not in A;. So x 
can’t be in U;A;. Hence x € (U;A;)°. This shows that N; A$ C (U;A;)°. The second result follows from 
the first by applying the first result to A§ for 7 € J and then taking complements of both sides. 
(a) Blue card numbered 2 or 4. 
(b) Blue card numbered 5, 6, 7, 8, 9, or 10. 
(c) Any blue card or a red card numbered 1, 2, 3, 4, 6, 8, or 10. 
(d) 

) 


(e 


(a) These are the points not in A, hence they must be either below 1 or above 5. That is A° = {z: 
ge lore >.) 


Blue card numbered 2, 4, 6, 8, or 10, or red card numbered 2 or 4. 


Red card numbered 5, 7, or 9. 


(b) These are the points in either A or B or both. So they must be between 1 and 5 or between 3 and 
T. Vhat is, AUB = {esl =o < TH. 


(c) These are the points in B but not in C. That is BC° = {x:3<a2< 7}. (Note that BC C%.) 
(d) These are the points in none of the three sets, namely ACB°C® = {7 :0<a<loraz> 7}. 


(e) These are the points in the answer to part (b) and in C. There are no such values and (AUB)C = 0. 


Blood type A reacts only with anti-A, so type A blood corresponds to 4M B°. Type B blood reacts 
only with anti-B, so type B blood corresponds to A°B. Type AB blood reacts with both, so AN B 
characterizes type AB blood. Finally, type O reacts with neither antigen, so type O blood corresponds 
to the event A°B*. 


(a) For each n, Bn = Bn4iU An, hence B, D Byy1 for all n. For each n, Cni1M An = Ch, so 
Ch C Ch41- 


(b) Suppose that x € N72, B,. Then x € B,, for all n. That is, x € U72,,A; for all n. For n = 1, there 
exists 7 > n such that x € A;. Assume to the contrary that there are at most finitely many 7 such 
that « € A;. Let m be the largest such i. For n = m-+1, we know that there is 7 > n such that 
x € A;. This contradicts m being the largest i such that x € A;. Hence, x is in infinitely many 
A;. For the other direction, assume that x is in infinitely many A;. Then, for every n, there is a 


value of j > n such that x € Aj, hence x € U7, A; = By for every n and & € N72, Bn. 


(c) Suppose that z € UP2,C,. That is, there exists n such that z € C, = MP,,Ai, so x € A; for 
alli > n. So, there at most finitely many i (a subset of 1,...,n—1) such that x ¢ A;. Finally, 
suppose that x € A; for all but finitely many 7. Let k be the last i such that x ¢ A;. Then x € A; 
for alls > k+ 1, hence 2 en, 7A; = Cyrr. Hence e 6 UP Cy. 
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10. (a) All three dice show even numbers if and only if all three of A, B, and C occur. So, the event is 
ANBNC. 


(b) None of the three dice show even numbers if and only if all three of A°, B®, and C*% occur. So, the 
event is ASN BSN C®. 

(c) At least one die shows an odd number if and only if at least one of A°, B°, and C® occur. So, the 
event is A®U BSUC*. 

(d) At most two dice show odd numbers if and only if at least one die shows an even number, so 
the event is AU BUC. This can also be expressed as the union of the three events of the form 
AN BNC® where exactly one die shows odd together with the three events of the form AN BSNC* 
where exactly two dice show odd together with the even AM BMC where no dice show odd. 

(e) We can enumerate all the sums that are no greater than 5: 1+1+1,2+1+4+1,1+2+4+1,14+1+42, 
24+24+1,24+1+4+2, and1+2+2. The first of these corresponds to the event A; 1 By NC}, the 
second to Az By NC}, etc. The union of the seven such events is what is requested, namely 


(AYN ByNC})U(AgNBiNC1)U(A1N BaNC1)U(A1N Bi NC2)U(A2N BNC} )U(A2N Bi NC2)U(AiN BenC). 


i ee 


YN 


All of the events mentioned can be determined by knowing the voltages of the two subcells. Hence 
the following set can serve as a sample space 


5={@9)20 <2 <5 end 0< 75), 


where the first coordinate is the voltage of the first subcell and the second coordinate is the voltage 
of the second subcell. Any more complicated set from which these two voltages can be determined 
could serve as the sample space, so long as each outcome could at least hypothetically be learned. 

(b) The power cell is functional if and only if the sum of the voltages is at least 6. Hence, A = {(x,y) € 
S:a2+y > 6}. It is clear that B = {(z,y) € S:c2=y}andC = {(z,y)€ S:2>y}. The 
powercell is not functional if and only if the sum of the voltages is less than 6. It needs less than 
one volt to be functional if and only if the sum of the voltages is greater than 5. The intersection 
of these two is the event D = {(z,y) € S:5<a+y < 6}. The restriction “E S” that appears 
in each of these descriptions guarantees that the set is a subset of S. One could leave off this 
restriction and add the two restrictions 0 <7 <5 and 0 <y <5 to each set. 

(c) The description can be worded as “the power cell is not functional, and needs at least one more 
volt to be functional, and both subcells have the same voltage.” This is the intersection of A°, D°, 
and B. That is, ASM D©N B. The part of D° in which x + y > 6 is not part of this set because of 
the intersection with A°. 

(d) We need the intersection of A° (not functional) with C° (second subcell at least as big as first) and 
with B® (subcells are not the same). In particular, C°M B® is the event that the second subcell is 
strictly higher than the first. So, the event is ASM BSN C*. 


1.5 The Definition of Probability 


Solutions to Exercises 
1. Define the following events: 
A = {the selected ball is red}, 


B {the selected ball is white}, 
C = {the selected ball is either blue, yellow, or green}. 
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We are asked to find Pr(C). The three events A, B, and C are disjoint and AU BUC = S. So 
1 = Pr(A) + Pr(B) + Pr(C). We are told that Pr(A) = 1/5 and Pr(B) = 2/5. It follows that 
Pr(G) 275. 


. Let B be the event that a boy is selected, and let G be the event that a girl is selected. We are told 


that BUG = S, so G= B®. Since Pr(B) = 0.3, it follows that Pr(G) = 0.7. 


(a) If A and B are disjoint then B Cc A‘ and BA‘ = B, so Pr(BA°) = Pr(B) = 1/2. 

(b) If AC B, then B = AU (BA) with A and BA‘ disjoint. So Pr(B) = Pr(A) + Pr(BA’*). That is, 
1/2 = 1/3+ Pr(BA*), so Pr(BA*) = 1/6. 

(c) According to Theorem 1.4.11, B = (BA) U(BA’*). Also, BA and BA*® are disjoint so, Pr(B) = 
Pr(BA) + Pr(BA‘). That is, 1/2 =1/8+ Pr(BA*), so Pr(BA‘) = 3/8. 


. Let Ey be the event that student A fails and let Ey be the event that student B fails. We want 


Pr(E, U Eo). We are told that Pr(£,) = 0.5, Pr(£2) = 0.2, and Pr(£,E2) = 0.1. According to 
Theorem 1.5.7, Pr(£, U Eo) = 0.5 +0.2 —0.1 = 0.6. 


. Using the same notation as in Exercise 4, we now want Pr(E{M £5). According to Theorems 1.4.9 


and 1.5.3, this equals 1 — Pr(£ U E2) = 0.4. 


. Using the same notation as in Exercise 4, we now want Pr([F,M E5]U [Ef E]). These two events are 


disjoint, so 
Pr([Fy 9 Es] U [EY 9 Bo]) = Pr( FB, 9 Es) + Pr( ET £2). 
Use the reasoning from part (c) of Exercise 3 above to conclude that 


Pr( Fy al ES) = Pr(E}) _ Pr( Ey ial Ep) = 04. 
Pr( Ef NM Ep) = Pr(E2) _ Pr( Ey al Ep) = 0.1. 


It follows that the probability we want is 0.5. 


. Rearranging terms in Eq. (1.5.1) of the text, we get 


Pr(AN B) = Pr(A) + Pr(B) — Pr(AU B) = 0.44 0.7 — Pr(AU B) = 1.1—Pr(AUB). 


So Pr(A/N B) is largest when Pr(A U B) is smallest and vice-versa. The smallest possible value for 
Pr(AUB) occurs when one of the events is a subset of the other. In the present exercise this could only 
happen if A C B, in which case Pr(A U B) = Pr(B) = 0.7, and Pr(AN B) = 0.4. The largest possible 
value of Pr(A U B) occurs when either A and B are disjoint or when AU B = S. The former is not 
possible since the probabilities are too large, but the latter is possible. In this case Pr(A U B) = 1 and 
Pr(An B) = 0.1. 


. Let A be the event that a randomly selected family subscribes to the morning paper, and let B be the 


event that a randomly selected family subscribes to the afternoon paper. We are told that Pr(A) = 0.5, 
Pr(B) = 0.65, and Pr(AU B) = 0.85. We are asked to find Pr(AN B). Using Theorem 1.5.7 in the text 
we obtain 


Pr(AN B) = Pr(A) + Pr(B) — Pr(AU B) = 0.5 + 0.65 — 0.85 = 0.3. 
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9. The required probability is 


Pr(AN BY) + Pr(A°B) [Pr(A) — Pr(AN B)] + [Pr(B) — Pr(AN B)| 


Pr(A) + Pr(B) — 2Pr(An B). 


10. Theorem 1.4.11 says that A = (AN B)U(ANB?*). Clearly the two events AN B and AN B® are disjoint. 
It follows from Theorem 1.5.6 that Pr(A) = Pr(AN B) + Pr(An B®). 


11. (a) The set of points for which (2 — 1/2)? + (y—1/2)? < 1/4 is the interior of a circle that is contained 
in the unit square. (Its center is (1/2,1/2) and its radius is 1/2.) The area of this circle is 7/4, so 
the area of the remaining region (what we want) is 1 — 7/4. 


(b) We need the area of the region between the two lines y = 1/2—2z and y = 3/2—2. The remaining 
area is the union of two right triangles with base and height both equal to 1/2. Each triangle has 
area 1/8, so the region between the two lines has area 1 — 2/8 = 3/4. 


(c) We can use calculus to do this. We want the area under the curve y = 1 — x? between x = 0 and 
x =1. This equals 


3il 


1 
[a-@)de= 2-5 
0 3 


2 


3" 


x=0 


(d) The area of a line is 0, so the probability of a line segment is 0. 


12. The events B,, Bo,... are disjoint, because the event B, contains the points in A;, the event By contains 
the points in Ag but not in Aj, the event B3 contains the points in Az but not in A; or Ag, etc. By 
this same reasoning, it is seen that U?_, A; = Ul_, B; and U?2, A; = UPL, B;. Therefore, 


a) eae) 


and 
[o.e) [oe] 
Pe (U A.) = Pr (U #) 
i=1 i=1 
However, since the events B,, Bo,... are disjoint, 
nm nm 
Pr (U #.) = Pr( Bi) 
i=1 i=1 
and 
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Furthermore, from the definition of the events B,,...,B, it is seen that B; C A; for i = 1,...,n. 
Therefore, by Theorem 1.5.4, Pr(B;) < Pr(A;) for i=1,...,n. It now follows that 


i=1 i=1 


(Of course, if the events Aj,...,A, are disjoint, there is equality in this relation.) 


For the second part, apply the first part with A; replaced by Af fori =1,...,n. We get 


n 


Pr (L.4%) < >> Pr( 49). (8.1.1) 


i=1 


Exercise 5 in Sec. 1.4 says that the left side of (S.1.1) is Pr ({( A;]°). Theorem 1.5.3 says that this last 
probability is 1 — Pr (() A;). Hence, we can rewrite (S.1.1) as 


Finally take one minus both sides of the above inequality (which reverses the inequality) and produces 
the desired result. 


First, note that the probability of type AB blood is 1—(0.5+0.34+0.12) = 0.04 by using Theorems 1.5.2 
and 1.5.3. 


(a) The probability of blood reacting to anti-A is the probability that the blood is either type A or 
type AB. Since these are disjoint events, the probability is the sum of the two probabilities, namely 
0.34 + 0.04 = 0.38. Similarly, the probability of reacting with anti-B is the probability of being 
either type B or type AB, 0.12 + 0.04 = 0.16. 


(b) The probability that both antigens react is the probability of type AB blood, namely 0.04. 


1.6 Finite Sample Spaces 


Solutions to Exercises 


i. 


The safe way to obtain the answer at this stage of our development is to count that 18 of the 36 
outcomes in the sample space yield an odd sum. Another way to solve the problem is to note that 
regardless of what number appears on the first die, there are three numbers on the second die that will 
yield an odd sum and three numbers that will yield an even sum. Either way the probability is 1/2. 


. The event whose probability we want is the complement of the event in Exercise 1, so the probability 


is also 1/2. 


. The only differences greater than or equal to 3 that are available are 3, 4 and 5. These large difference 


only occur for the six outcomes in the upper right and the six outcomes in the lower left of the array 
in Example 1.6.5 of the text. So the probability we want is 1 — 12/36 = 2/3. 


. Let x be the proportion of the school in grade 3 (the same as grades 2-6). Then 22 is the proportion in 


grade 1 and 1 = 2x + 5a = 7x. So x = 1/7, which is the probability that a randomly selected student 
will be in grade 3. 


8. 
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. The probability of being in an odd-numbered grade is 27 + «+ x = 4x = 4/7. 


. Assume that all eight possible combinations of faces are equally likely. Only two of those combinations 


have all three faces the same, so the probability is 1/4. 


. The possible genotypes of the offspring are aa and Aa, since one parent will definitely contribute an 


a, while the other can contribute either A or a. Since the parent who is Aa contributes each possible 
allele with probability 1/2 each, the probabilities of the two possible offspring are each 1/2 as well. 
(a) The sample space contains 12 outcomes: (Head, 1), (Tail, 1), (Head, 2), (Tail, 2), ete. 


(b) Assume that all 12 outcomes are equally likely. Three of the outcomes have Head and an odd 
number, so the probability is 1/4. 


1.7 Counting Methods 


Commentary 


If you wish to stress computer evaluation of probabilities, then there are programs for computing factorials 
and log-factorials. For example, in the statistical software R, there are functions factorial and lfactorial 
that compute these. If you cover Stirling’s formula (Theorem 1.7.5), you can use these functions to illustrate 
the closeness of the approximation. 


Solutions to Exercises 


1. 


Each pair of starting day and leap year/no leap year designation determines a calendar, and each 
calendar correspond to exactly one such pair. Since there are seven days and two designations, there 
are a total of 7 x 2 = 14 different calendars. 


. There are 20 ways to choose the student from the first class, and no matter which is chosen, there are 18 


ways to choose the student from the second class. No matter which two students are chosen from the first 
two classes, there are 25 ways to choose the student from the third class. The multiplication rule can be 
applied to conclude that the total number of ways to choose the three members is 20 x 18 x 25 = 9000. 


. This is a simple matter of permutations of five distinct items, so there are 5! = 120 ways. 


. There are six different possible shirts, and no matter what shirt is picked, there are four different slacks. 


So there are 24 different combinations. 


. Let the sample space consist of all four-tuples of dice rolls. There are 64 = 1296 possible outcomes. 


The outcomes with all four rolls different consist of all of the permutations of six items taken four at a 
time. There are P,4 = 360 of these outcomes. So the probability we want is 360/1296 = 5/18. 


. With six rolls, there are 6° = 46656 possible outcomes. The outcomes with all different rolls are 


the permutations of six distinct items. There are 6! = 720 outcomes in the event of interest, so the 
probability is 720/46656 = 0.01543. 


. There are 20" possible outcomes in the sample space. If the 12 balls are to be thrown into different 


boxes, the first ball can be thrown into any one of the 20 boxes, the second ball can then be thrown 
into any one of the other 19 boxes, etc. Thus, there are 20-19-18---9 possible outcomes in the event. 
So the probability is 20!/[8!20'7]. 
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. There are 7° possible outcomes in the sample space. If the five passengers are to get off at different 


floors, the first passenger can get off at any one of the seven floors, the second passenger can then get 
off at any one of the other six floors, etc. Thus, the probability is 
7:6-5-4-3 360 
7 2401" 


. There are 6! possible arrangements in which the six runners can finish the race. If the three runners 


from team A finish in the first three positions, there are 3! arrangements of these three runners among 
these three positions and there are also 3! arrangements of the three runners from team B among the 
last three positions. Therefore, there are 3! x 3! arrangements in which the runners from team A 
finish in the first three positions and the runners from team B finish in the last three positions. Thus, 
the probability is (3!3!)/6! = 1/20. 


We can imagine that the 100 balls are randomly ordered in a list, and then drawn in that order. Thus, 
the required probability in part (a), (b), or (c) of this exercise is simply the probability that the first, 
fiftieth, or last ball in the list is red. Each of these probabilities is the same Too’ because of the random 
order of the list. 


In terms of factorials, Pp, = n!/[k!(n — k)!]. Since we are assuming that n and n = k are large, we 
can use Stirling’s formula to approximate both of them. The approximation to n! is (Q7)!/2n"+1/2e-”, 
and the approximation to (n — k)! is (27)'/?(n — k)"-*+4/2e-"+*_ The approximation to the ratio is 
the ratio of the approximations because the ratio of each approximation to its corresponding factorial 
converges to 1. That is, 


kin —k)! Rl(Qm)'2(n — kyon il 


n! (Qr)/2nrtl/2e-n e kn (1 oo 
. ; 


Further simplification is available if one assumes that k is small compared to n, that is k/n = 0. In this 
case, the last factor is approximately e*, and the whole approximation simplifies to n*/k!. This makes 
sense because, if n/(n — k) is essentially 1, then the product of the k largest factors in n! is essentially 


nk, 


1.8 Combinatorial Methods 


Commentary 


This section ends with an extended example called “The Tennis Tournament”. This is an application of 
combinatorics that uses a slightly subtle line of reasoning. 


Solutions to Exercises 


I 


Ds 


We have to assign 10 houses to one pollster, and the other pollster will get to canvas the other 10 
houses. Hence, the number of assignments is the number of combinations of 20 items taken 10 at a 


time, 
20 
= 184,756. 
ie 


93 93 93 
Th tio of is 31 1 is | ; 
e ratio o fe to (3) is 31/63 < 1, so ta is larger 
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. Since 93 = 63 + 30, the two numbers are the same. 


. Let the sample space consist of all subsets (not ordered tuples) of the 24 bulbs in the box. There are 
24 
4 

want is 1/10626. 


= 10626 such subsets. There is only one subset that has all four defectives, so the probability we 


4251! 4251 
. The number is 2 = ( 


(9714154!) _ 97 ) an integer. 


n 
. There are (;) possible pairs of seats that A and B can occupy. Of these pairs, n — 1 pairs comprise 


n—-1 
two adjacent seats. Therefore, the probability is ——— = 


k 
sets of k adjacent seats, so the probability we want is 


. There are (:) possible sets of k seats to be occupied, and they are all equally likely. There aren—k+1 


n—-k+1_ (n—k+1)!k! 


k 
. There are @ possible sets of k seats to be occupied, and they are all equally likely. Because the circle 


has no start or end, there are n sets of k adjacent seats, so the probability we want is 


n  (n—k)Ik! 


day teal 
(i) 


. This problem is slightly tricky. The total number of ways of choosing the n seats that will be occupied 
2 

by the n people is ae Offhand, it would seem that there are only two ways of choosing these seats 
n 


so that no two adjacent seats are occupied, namely: 
XOX0...0 and OXOX...0X 

Upon further consideration, however, n — 1 more ways can be found, namely: 
XOOXOX...0X, XOXOOXOX...0X, etc. 


Therefore, the total number of ways of choosing the seats so that no two adjacent seats are occupied is 
n+ 1. The probability is (n + 1)/(7"). 
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10. We shall let the sample space consist of all subsets (unordered) of 10 out of the 24 light bulbs in the 


24 
box. There are 10 such subsets. The number of subsets that contain the two defective bulbs is the 


22 
number of subsets of size 8 out of the other 22 bulbs, ( 8 i so the probability we want is 


6 
8 1 
2a = 0.1630. 


24\ 24 x 23 
10 


11. This exercise is similar to Exercise 10. Let the sample space consist of all subsets (unordered) of 12 out 


100 
of the 100 people in the group. There are ( 19 such subsets. The number of subsets that contain A 


98 
and B is the number of subsets of size 10 out of the other 98 people, i , so the probability we want 


is 


98 
10 12 x 11 
= = 0.01333. 


100\ 100 x 99 
12 


35 
12. There are a ways of dividing the group into the two teams. As in Exercise 11, the number of ways 


33 
of choosing the 10 players for the first team so as to include both A and B is & I The number of 
ways of choosing the 10 players for this team so as not to include either A or B (A and B will then be 


oo 
together on the other team) is el The probability we want is then 


(3) : 8) 
8 10 
(a oe Se 


35 350 x 34 
10 


13. This exercise is similar to Exercise 12. Here, we want four designated bulbs to be in the same group. 
The probability is 


bn) 


14. 


15. 


16. 


1%. 


18. 
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(:) = (," : = mew” Ga line 


n! iL 1 
— iin — (+acexa) 
n! n+1 
(k—Dl(n—k)! k(n—k+1) 


_ (n+)! | fnt+l 
— kéin-k+D! \ kf 


(a) If we express 2” as (1+ 1)” and expand (1+ 1)” by the binomial theorem, we obtain the desired 
result. 


(b) If we express 0 as (1 — 1)” and expand (1 — 1)” by the binomial theorem, we obtain the desired 
result. 


(a) It is easier to calculate first the probability that the committee will not contain either of the two 


98 100 
senators from the designated state. This probability is ( 3 / ( 8 ) Thus, the final answer is 


8 
= 1 — .08546 = 0.1543. 
100 
8 


100 
(b) There are ( 50 combinations that might be chosen. If the group is to contain one senator from 


each state, then there are two possible choices for each of the fifty states. Hence, the number of 
possible combinations containing one senator from each state is 2°°. 


Call the four players A, B, C, and D. The number of ways of choosing the positions in the deck that 


will be occupied by the four aces is . Since player A will receive 13 cards, the number of ways 


5 
4 
of choosing the positions in the deck for the four aces so that all of them will be received by player 


13 
A is a} Similarly, since player B will receive 13 other cards, the number of ways of choosing the 


13 
positions for the four aces so that all of them will be received by player B is ( 4 ) A similar result is 
true for each of the other players. Therefore, the total number of ways of choosing the positions in the 


i3 
deck for the four aces so that all of them will be received by the same player is i( 4 . Thus, the final 


13 52 
bability is 4 : 
probability is (Ty) 


100 20 
There are ( 10 ways of choosing ten mathematics students. There are é ways of choosing two 


12 


19. 


20. 
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20\" 
students from a given class of 20 students. Therefore, there are ( i) ways of choosing two students 


20)” (100 
from each of the five classes. So, the final answer is ( 9 i( 10 & 0.0143. 


From the description of what counts as a collection of customer choices, we see that each collection 
consists of a tuple (m,..., Mn), where m; is the number of customers who choose item i for? = 1,...,n. 
Each m; must be between 0 and k and m, +---+m,y =k. Each such tuple is equivalent to a sequence 
of n+ k—1 0’s and 1’s as follows. The first m, terms are 0 followed by a 1. The next mz terms are 0 
followed by a 1, and so on up to my_; 0’s followed by a 1 and finally m, 0’s. Since mj, +---+mny =k 
and since we are putting exactly n — 1 1’s into the sequence, each such sequence has exactly n +k —1 
terms. Also, it is clear that each such sequence corresponds to exactly one tuple of customer choices. 
The numbers of 0’s between successive 1’s give the numbers of customers who choose that item, and 
the 1’s indicate where we switch from one item to the next. So, the number of combinations of choices 
n+k—-1 


is the number of such sequences: k 


We shall use induction. For n = 1, we must prove that 


Since the right side of this equation is x + y, the theorem is true for n = 1. Now assume that the 


theorem is true for each n = 1,...,n9 for no > 1. For n = no +1, the theorem says 
no+l 
1 
(x+y)rott = S* na gyre, (S.1.2) 
k=0 k 


Since we have assumed that the theorem is true for n = ng, we know that 


(a+y)"™ = iy (":) oe. (S33) 


k=0 


We shall multiply both sides of (S.1.3) by «+ y. We then need to prove that x+y times the right side 
of (S.1.3) equals the right side of (S.1.2). 


(et+ye+ye = @+yd> ("") hynonk 


k=0 


me pe ieee 


E (sereEa)erm 
roE (ea) (eee 


Now, apply the result in Exercise 14 to conclude that 


(1) (i) = (*e") 
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This makes the final summation above equal to the right side of (S.1.2). 


21. We are asked for the number of unordered samples with replacement, as constructed in Exercise 19. 
Here, n = 365, so there are eae different unordered sets of & birthdays chosen with replacement 
from 1,..., 365. 


22. The approximation to n! is (27)!/2n"+1/2e—”, and the approximation to (n/2)! is (2a)!/?(n/2)/2+V/2e—-"/?. 


Then 
n! (Qr)V/2nrtl/2e-n 


nD 


Tm J9\2 ~~ 1(97)1/2(m J9\n/2t1/2p—n/2)2 —1/259n+1,,-1/2 
(n/2)!? [(2) 1/2 (n/2)"/24+1/2@—n/2)2 (27) gntly 


With n = 500, the approximation is e?4*4, too large to represent on a calculator with only two-digit 
exponents. The actual number is about 1/20 of 1% larger. 


1.9 Multinomial Coefficients 


Commentary 


Multinomial coefficients are useful as a counting method, and they are needed for the definition of the 
multinomial distribution in Sec. 5.9. They are not used much elsewhere in the text. Although this section 
does not have an asterisk, it could be skipped (together with Sec. 5.9) if one were not interested in the 
multinomial distribution or the types of counting arguments that rely on multinomial coefficients. 


Solutions to Exercises 


1. We have three types of elements that need to be assigned to 21 houses so that exactly seven of each 
type are assigned. The number of ways to do this is the multinomial coefficient 


21) = 399,072,960 
Ct eee 


2. We are asked for the number of arrangements of four distinct types of objects with 18 or one type, 12 
50 


of the next, 8 of the next and 12 of the last. This is the multinomial coefficient 18.12.8.12} 


3. We need to divide the 300 members of the organization into three subsets: the 5 in one committee, the 


300 
8 in the second committee, and the 287 in neither committee. There are (, 3 - ways to do this. 


10 


4. Th 
ere are er 


arrangements of the 10 letters of four distinct types. All of them are equally 


10 


= 1/50400. 
ea) pe 


likely, and only one spells statistics. So, the probability is 1/ ( 


5. There are ( . many ways to arrange n; j’s (for j = 1,...,6) among the n rolls. The 


1, 72, 703, 14,5, 26 


1 n 
number of possible equally likely rolls is 6”. So, the probability we want is — ‘ 
oP 1, 12,73, 4,5, 16 


14 


. There are ( 
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. There are 6’ possible outcomes for the seven dice. If each of the six numbers is to appear at least once 


among the seven dice, then one number must appear twice and each of the other five numbers must 

appear once. Suppose first that the number 1 appears twice and each of the other numbers appears 

once. The number of outcomes of this type in the sample space is equal to the number of different 
! 


arrangements of the symbols 1, 1, 2, 3, 4, 5, 6, which is amas a : There is an equal number of 
outcomes for each of the other five numbers which might appear twice among the seven dice. Therefore, 


. 6 eat 
the total number of outcomes in which each number appears at least once is , and the probability 


of this event is 


67!) 7! 


2)67 = -2(66) 


— 


25 12 
. There are ( ways of distributing the 25 cards to the three players. There are ways 


10, 8,7 fy 
of distributing the 12 red cards to the players so that each receives the designated number of red 


cards. There are then ways of distributing the other 13 cards to the players, so that each 


13 
4,6,3 
receives the designated total number of cards. The product of these last two numbers of ways is, 
therefore, the number of ways of distributing the 25 cards to the players so that each receives the 


designated number of red cards and the designated total number of cards. So, the final probability is 
12 13 / 25 
6,2,4) \4,6,3 10,8,7)° 


52 12 

13,13, 13, . ways of distributing the cards to the four players. There are ea) 

ways of distributing the 12 picture cards so that each player gets three. No matter which of these ways 
40 

we choose, there are 1610-10-40 ways to distribute the remaining 40 nonpicture cards so that each 

player gets 10. So, the probability we need is 


12 40 12! 40! 
3,3, 3,3 \ 10,10, 10, 10 (3)? (Gon? 


= 0.0324. 
52 
13,13, 13,78 (13!)* 
52 Seopa as 
. There are e 13.13 " ways of distributing the cards to the four players. Call these four players A, 


B, C, and D. There is only one way of distributing the cards so that player A receives all red cards, 
player B receives all yellow cards, player C receives all blue cards, and player D receives all green cards. 
However, there are 4! ways of assigning the four colors to the four players and therefore there are 4! 
ways of distributing the cards so that each player receives 13 cards of the same color. So, the probability 
we need is 


= =a 41(13!)" ws 4.474 x 10778. 


52 52! 
13, 13, 13,13 


Section 1.9. Multinomial Coefficients 15 


9 
10. If we do not distinguish among boys with the same last name, then there are ( ) possible arrange- 


11. 


ments of the nine boys. We are interested in the probability of a particular one of these arrangements. 
So, the probability we need is 


1 agi! 


=r 
2,3,4 


We shall use induction. Since we have already proven the binomial theorem, we know that the conclusion 
to the multinomial theorem is true for every n if k = 2. We shall use induction again, but this time 
using k instead of n. For k = 2, we already know the result is true. Suppose that the result is true for 
all k < ko and for all n. For k = kg +1 and arbitrary n we must show that 


~ 7.937 x 1074. 


s n Nkg+1 
Ute: +p 1" = gei...g 0 (S.1.4) 
( O+ ) m1,.-- 1 Mko+1 1 ko+1°? 
where the summation is over all nj,...,Mx 41 Such that ny +--+ + Ng41 = n. Let ys = 2; for 


t=1,...,k9 —1 and let y,, = ky + 2ikg41- We then have 


(rer + oper)” = (et Peg) 


Since we have assumed that the theorem is true for k = ko, we know that 


nm m m 


M1,+++,Mko 


where the summation is over all mj ,..., mx, such that m;+---+m,, =n. On the right side of (5.1.5), 
substitute %%. + 941 for yz, and apply the binomial theorem to obtain 


Mko 
nr m4 Mkg-1 Mkg a Mky—t 
Yo Ue 1 - | Xkoleg41 - (S.1.6) 
3 oe 0 ps i oar 


In (8.1.6), let nj = m; fori = 1,..., ko —1, let nx, = 7, and let nz,41 = mz, —1. Then, in the summation 
in (S.1.6), 27 + +--+ nx o41 = 7 if and only ifm; +---+m,z, =n. Also, note that 


n Mk _ n 
M1,-++, Mk 1 N15 +++ 5 MUko+1 


So, (5.1.6) becomes 


n ny Nko+1 
>( Jer ag ee 


M1, +++ 5 Mko+1 


where this last sum is over all n1,..., 941 Such that ny +--+ + Mgo41 = 7. 
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12. For each element s’ of $’, the elements of S that lead to boxful s’ are all the different sequences of 
elements of s’. That is, think of each s’ as an unordered set of 12 numbers chosen with replacement 
from 1 to 7. For example, {1,1,2,3,3,3,5,6,7,7,7,7} is one such set. The following are some of 
the elements of S lead to the same set s’: (1,1,2,3,3,3,5,6,7, 7,7, 7), (1,2,3,5, 6, 7, 1,3, 7, 3, 7,7), 
(7,1, 7,2,3,5,7,1,6,3, 7,3). This problem is pretty much the same as that which leads to the definition 
of multinomial coefficients. We are looking for the number of orderings of 12 digits chosen from the 
numbers | to 7 that have two of 1, one of 2, three of 3, none of 4, one of 5, one of 6, and four of 7. This 
is just Gian ihe) For a general s’, for i= 1,...,7, let n;(s’) be the number of i’s in the box s’. Then 
ni(s’) + +++ +7(s’) = 12, and the number of orderings of these numbers is 


The multinomial theorem tells us that 


ys we) =>( 12 : ream = 7/7. 
2 FET 


All s! a ae 


where the sum is over all possible combinations of nonnegative integers n1,...,7 that add to 12. This 
matches the number of outcomes in S. 


1.10 The Probability of a Union of Events 


Commentary 


This section ends with an example of the matching problem. This is an application of the formula for the 
probability of a union of an arbitrary number of events. It requires a long line of argument and contains an 
interesting limiting result. The example will be interesting to students with good mathematics backgrounds, 
but it might be too challenging for students who have struggled to master combinatorics. One can use 
statistical software, such as R, to help illustrate how close the approximation is. The formula (1.10.10) can 
be computed as 

ints=1:n 

sum(exp(-lfactorial (ints) )*(-1)*(ints+1)), 

where n has previously been assigned the value of n for which one wishes to compute pn. 


Solutions to Exercises 


1. Let A; be the event that person 7 receives exactly two aces for i = 1,2,3. We want Pr(U3_,A;). We 
shall apply Theorem 1.10.1 directly. Let the sample space consist of all permutations of the 52 cards 
where the first five cards are dealt to person 1, the second five to person 2, and the third five to person 
3. A permutation of 52 cards that leads to the occurrence of event A; can be constructed as follows. 
First, choose which of person 7’s five locations will receive the two aces. There are C52 ways to do 
this. Next, for each such choice, choose the two aces that will go in these locations, distinguishing the 
order in which they are placed. There are Py ways to do this. Next, for each of the preceding choices, 
choose the locations for the other two aces from among the 47 locations that are not dealt to person i, 
distinguishing order. There are P47,2 ways to do this. Finally, for each of the preceding choices, choose 
a permutation of the remaining 48 cards among the remaining 48 locations. There are 48! ways to do 
this. Since there are 52! equally likely permutations in the sample space, we have 


_ C52P12Pi7248! 514147148! 


= —___ & 0.0399. 
52! 2!3!2!45152! 
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Careful examination of the expression for Pr(A;) reveals that it can also be expressed as 


4\ (48 
Pr(A,) 2) \ 3 
r(A;) = -—-——.. 
; 52 
5 
This expression corresponds to a different, but equally correct, way of describing the sample space in 


terms of equally likely outcomes. In particular, the sample space would consist of the different possible 
five-card sets that person 7 could receive without regard to order. 


Next, compute Pr(A;A;) for i 4 j. There are still C52 ways to choose the locations for person i’s aces 
amongst the five cards and for each such choice, there are P42 ways to choose the two aces in order. 
For each of the preceding choices, there are C’5,2 ways to choose the locations for person j’s aces and 2 
ways to order the remaining two aces amongst the two locations. For each combination of the preceding 
choices, there are 48! ways to arrange the remaining 48 cards in the 48 unassigned locations. Then, 
Pr(A;A;) is 


2C3 P1248! 2(5!)?4148! 


=, — & 3.604 « 10-4. 
Bal (2!)3(31)252! * 


Pr(A;A;) = 


Once again, we can rewrite the expression for Pr(A;A;) as 


4\ ( 48 
2}\3,3,42 

Pr(A;A;) = a aa a 
é oy 


This corresponds to treating the sample space as the set of all pairs of five-card subsets. 


Next, notice that it is impossible for all three players to receive two aces, so Pr(A; A2A3) = 0. Applying 
Theorem 1.10.1, we obtain 


Pr (U1 A;) = 3 x 0.0399 — 3 x 3.694 x 10-4 = 0.1186. 


. Let A, B, and C stand for the events that a randomly selected family subscribes to the newspaper 
with the same name. Then Pr(A U BUC) is the proportion of families that subscribe to at least one 
newspaper. According to Theorem 1.10.1, we can express this probability as 


Pr(A) + Pr(B) + Pr(C) — Pr(An B) — Pr(AC) — Pr(BC) + Pr(An BC). 


The probabilities in this expression are the proportions of families that subscribe to the various com- 
binations. These proportions are all stated in the exercise, so the formula yields 


Pr(AU BUC) =0.64+0.4+0.3 — 0.2 — 0.1 —0.2 + 0.05 = 0.85. 


. As seen from Fig. $.1.1, the required percentage is P,; + P2, + P3. From the given values, we have, in 
percentages, 
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Kees 


Figure $.1.1: Figure for Exercise 3 of Sec. 1.10. 


P, =5, 

Py =20—P, =15, 

Ps =20— Pp = 15, 

Ps =10-— Pp =5, 

P, = 60 — P, — P; — P, = 35, 
Py =40- Py —-Ps —-Pp =5, 
Ps =30— Ps — Ps —Pr=5. 


Therefore, P, + P2 + P3 = 45. 


. This is a case of the matching problem with n = 3. We are asked to find ps. By Eq. (1.10.10) in the 


text, this equals 


a ae a 


. Determine first the probability that at least one guest will receive the proper hat. This probability is 


the value p, specified in the matching problem, with n = 4, namely 


_,-1,1 1 8 
Pa 976 2° 8 


So, the probability that no guest receives the proper hat is 1 — 5/8 = 3/8. 


. Let A; denote the event that no red balls are selected, let Az denote the event that no white balls 


are selected, and let A3 denote the event that no blue balls are selected. The desired probability is 
Pr(A; U Ag U A3) and we shall apply Theorem 1.10.1. The event A, will occur if and only if the ten 
selected balls are either white or blue. Since there are 60 white and blue balls, out of a total of 90 balls, 
we have Pr(A;) = Bue Similarly, Pr(A2) and Pr(Ag3) have the same value. The event A; Ao 
30 / 90 
10 10 
Pr(A2A3) and Pr(A;A3) have the same value. Finally, the event A;A2A3 will occur if and only if all 
three colors are missing, which is obviously impossible. Therefore, Pr(A;A2A3) = 0. When these values 


will occur if and only if all ten selected balls are blue. Therefore, Pr(A;A2) = . Similarly, 


10. 


11. 
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are substituted into Eq. (1.10.1), we obtain the desired probability, 


(3) 


Pr(A; U Ap U Ag) = 332 — 35-4 


Co) (0) 


. Let A; denote the event that no student from the freshman class is selected, and let A ,A3, and 


A, denote the corresponding events for the sophomore, junior, and senior classes, respectively. The 
probability that at least one student will be selected from each of the four classes is equal to 1—Pr(A,U 
A2 U Ag U Ay). We shall evaluate Pr(A; U Ag U A3 U Ay) by applying Theorem 1.10.2. The event Aj 
will occur if and only if the 15 selected students are sophomores, juniors, or seniors. Since there are 


90 100 
i( is | The values of Pr(A;) 


for 7 = 2,3,4 can be obtained in a similar fashion. Next, the event AA» will occur if and only if the 
15 selected students are juniors or seniors. Since there are a total of 70 juniors and seniors, we have 
70 100 

Pr(A,A2) = 
(Ai Az) ul 15 


obtained in this way. Next the event A, A A3 will occur if and only if all 15 selected students are seniors. 


4 1 
Therefore, Pr(A;A2A3) = ( _ i( al The probabilities of the events A;A2A,4 and A;A3A4 can also 


90 such students out of a total of 100 students, we have Pr(A;) = 


. The probability of each of the six events of the form A;Aj; for i < 7 can be 


15 15 
be obtained in this way. It should be noted, however, that Pr(AzA3A4) = 0 since it is impossible that 
all 15 selected students will be freshmen. Finally, the event A;A2A3Ay4 is also obviously impossible, so 
Pr(A;A2A3A4) = 0. So, the probability we want is 


(i) | Cts) | (i). is) 


ee) 
OF FOO 8.0 @ 


3) C8) Ce) C2) CS) CY GY 


. It is impossible to place exactly n—1 letters in the correct envelopes, because if n — 1 letters are placed 


correctly, then the nth letter must also be placed correctly. 


. Let pp = 1— qn. As discussed in the text, pio < p300 < 0.63212 < ps3 < poi. Since py is smallest for 


n = 10, then q, is largest for n = 10. 


There is exactly one outcome in which only letter 1 is placed in the correct envelope, namely the 
outcome in which letter 1 is correctly placed, letter 2 is placed in envelope 3, and letter 3 is placed in 
envelope 2. Similarly there is exactly one outcome in which only letter 2 is placed correctly, and one 
in which only letter 3 is placed correctly. Hence, of the 3! = 6 possible outcomes, 3 outcomes yield the 
result that exactly one letter is placed correctly. So, the probability is 3/6 = 1/2. 


Consider choosing 5 envelopes at random into which the 5 red letters will be placed. If there are exactly 
r red envelopes among the five selected envelopes (r = 0,1,...,5), then exactly x = 2r envelopes will 


20 


12. 


13. 
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contain a card with a matching color. Hence, the only possible values of x are 0, 2, 4..., 10. Thus, 
for x = 0,2,...,10 and r = 2/2, the desired probability is the probability that there are exactly r red 


5 5 
r)/\5—r 
10 , 
5 
It was shown in the solution of Exercise 12 of Sec. 1.5. that 


Pr (U A.) = 2 Pay) = a2 Pee = Jim, Pr (U #.) = Jim, Pr (U 4) . 


i=1 


envelopes among the five selected envelopes, which is 


However, since Ay C Az C... C An, it follows that Uf_, Ai = An. Hence, 


Pr (U A.) = lim Pr( Ap}: 
i=1 


We know that 


i=1 i=1 
Hence, 
[oe (oe) 
Pr( 4,) =1-Pr(U as) 
i=l i=1 
However, since A; > Az >..., then Af C A§ C.... Therefore, by Exercise 12, 


Pr (U as) = lim Pr(A;,) = lim [1 — Pr(A,)] =1— lim Pr(An). 
i=1 


It now follows that 


Pe (9 A.) = lim Pr{A,). 
i=l 


1.12 Supplementary Exercises 


Solutions to Exercises 


1. 


2. 


No, since both A and B might occur. 


Pr(A*n Ben D*) =Pri{Au BUD) =03. 
250 100 
iS) At 
350 , 
30 
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2 4 
4. There are 7 ways of choosing 10 cards from the deck. For 7 = 1,...,5, there ()) ways of choosing 


two cards with the number j. Hence, the answer is 


Gi) Gi) 


5. The region where total utility demand is at least 215 is shaded in Fig. $.1.2. The area of the shaded 


Electric 


Figure $.1.2: Region where total utility demand is at least 215 in Exercise 5 of Sec. 1.12. 
region is 
1 
. x 185 x 1385 = 9112.5 
The probability is then 9112.5/29204 = 0.3120. 


6. (a) There are ( + ") possible positions that the red balls could occupy in the ordering as they are 
r 


drawn. Therefore, the probability that they will be in the first r positions is 1/ ( 7 ) 
r 


1 
(b) There are oY ways that the red balls can occupy the first r + 1 positions in the ordering. 
r 


1 
Therefore, the probability is ( = yi’ * ") =(p-4 17 ( = "). 
it r r 


7. The presence of the blue balls is irrelevant in this problem, since whenever a blue ball is drawn it is 
ignored. Hence, the answer is the same as in part (a) of Exercise 6. 


1 
8. There are ( ) ways of choosing the seven envelopes into which the red cards will be placed. There 


22 


10. 


11. 


12: 
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7 3 : : 

are ( ) 73 ways of choosing exactly 7 red envelopes and 7 — 7 green envelopes. Therefore, the 
Jj ul 

probability that exactly 7 red envelopes will contain red cards is 


Cesc for j = 4,5,6,7. 


But if 7 red envelopes contain red cards, then 7 — 4 green envelopes must also contain green cards. 
Hence, this is also the probability of exactly k = j + (j — 4) = 27 — 4 matches. 


1 
. There are (") ways of choosing the five envelopes into which the red cards will be placed. There 


7 3 

are ( ) 5 ) ways of choosing exactly 7 red envelopes and 5 — 7 green envelopes. Therefore the 
J = 

probability that exactly 7 red envelopes will contain red cards is 


() ¢ ie for j = 2,3,4,5. 


But if 7 red envelopes contain red cards, then 7 — 2 green envelopes must also contain green cards. 
Hence, this is also the probability of exactly k = 7 + (7 — 2) = 27 — 2 matches. 


If there is a point x that belongs to neither A nor B, then x belongs to both A° and B°. Hence, A° 
and B° are not disjoint. Therefore, A® and B° will be disjoint if and only if AUB=S. 


We can use Fig. S.1.1 by relabeling the events A, B, and C in the figure as Aj, Ag, and A3 respectively. 
It is now easy to see that the probability that exactly one of the three events occurs is py; + po + p3. 
Also, 


Pr(Ai) = pitpat+ pe + Pz, 
Pr(A, MN Ag) = patpr7, etc. 
By breaking down each probability in the given expression in this way, we obtain the desired result. 


The proof can be done in a manner similar to that of Theorem 1.10.2. Here is an alternative argument. 
Consider first a point that belongs to exactly one of the events A,,...,A,. Then this point will be 
counted in exactly one of the Pr(A;) terms in the given expression, and in none of the intersections. 
Hence, it will be counted exactly once in the given expression, as required. Now consider a point that 
belongs to exactly r of the events Aj,...,An(r > 2). Then it will be counted in exactly r of the Pr(A;) 


terms, exactly : of the Pr(A;A;) terms, exactly i of the Pr(A;A;A,) terms, etc. Hence, in the 


given expression it will be counted the following number of times: 


~ 2G) +3()-~-ar() 
Uo) GE) (Ee 


by Exercise b of Sec. 1.8. Hence, a point will be counted in the given expression if and only if it belongs 
to exactly one of the events A;,...,A,, and then it will be counted exactly once. 


13. 
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(a) In order for the winning combination to have no consecutive numbers, between every pair of 


es 


numbers in the winning combination there must be at least one number not in the winning com- 
bination. That is, there must be at least k — 1 numbers not in the winning combination to be 
in between the pairs of numbers in the winning combination. Since there are k numbers in the 
winning combination, there must be at least k + & — 1 = 2k — 1 numbers available in order for it 
to be possible to have no consecutive numbers in the winning combination. So, n must be at least 
2k — 1 to allow consecutive numbers. 


Let 71,...,7,% and j1,...,j3, be as described in the problem. For one direction, suppose that 
i1,---,%~ contains at least one pair of consecutive integers, say t¢41 = tq +1. Then 

jJati = lati —@=tg t1l—a=ig —(a—1) = ja. 
So, J1,---,jz contains repeats. For the other direction, suppose that 7),...,j, contains repeats, 
say Jat1 = Ja. Then 

tatl = Jati t@=jata=ig +1. 
So 71,...,%% contains a pair of consecutive numbers. 


Since iy < tg < +++ < ip, we know that ig +1 < ig41, so that jg =itg —a4t1 < ta41 —@ = ja41 for 
each a= 1,...,k —1. Since i, <n, jp = ip —kK+1<n—k+1. The set of all (j1,...,9,) with 
L<jp<-++ < gy <n—k +1 is just the number of combinations of n — k +1 items taken k at a 


—k+1 
time, that is (" = ) 


k; 
By part (b), there are no pairs of consecutive integers in the winning combination (i1,...,i,) if 
n 
and only if (j1,...,j~) has no repeats. The total number of winning combinations is kK} In part 
(c), we computed the number of winning combinations with no repeats among (j1,...,jx) to be 


—k+1 
(" k a ) So, the probability of no consecutive integers is 


n—-k+1 
k (n—k)!'(n—k+1)! 


n ~  al(n—2k +1)! 
() 


(e) The probability of at least one pair of consecutive integers is one minus the answer to part (d). 
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Chapter 2 


Conditional Probability 


2.1 The Definition of Conditional Probability 


Commentary 


It is useful to stress the point raised in the note on page 59. That is, conditional probabilities behave just 
like probabilities. This will come up again in Sec. 3.6 where conditional distributions are introduced. 

This section ends with an extended example called “The Game of Craps”. This example helps to reinforce 
a subtle line of reasoning about conditional probability that was introduced in Example 2.1.5. In particular, 
it uses the idea that conditional probabilities given an event B can be calculated as if we knew ahead of time 
that B had to occur. 


Solutions to Exercises 
1. If AC B, then AN B = A and Pr(AN B) = Pr(A). So Pr(A|B) = Pr(A)/ Pr(B). 
2. Since AN B = Q, it follows that Pr(AN B) = 0. Therefore, Pr(A | B) = 0. 
3. Since AN S = A and Pr(S) = 1, it follows that Pr(A | S) = Pr(A). 


4. Let A; stand for the event that the shopper purchases brand A on his ith purchase, for i = 1,2,.... 
Similarly, let B; be the event that he purchases brand B on the ith purchase. Then 


Pr(A;) 


Pr(Ag | A}) 


Pr(B3 | Ayn Ag) 


WlhRwlNMmwlrrelrFR 


Pr Bg | A,N Aan B3) 


The desired probability is the product of these four probabilities, namely 1/27. 


5. Let R; be the event that a red ball is drawn on the ith draw, and let B; be the event that a blue ball 
is drawn on the 7th draw fori =1,...,4. Then 


. 
r+b’ 


Pray) = 


26 
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Pak 
PEE VE) aa 
r+2k 
Pr(R3 | Ry M Ro) = Pb ok’ 
b 
Pr( By | Ri NR. Rs) ey ee 


The desired probability is the product of these four probabilities, namely 


r(r+k)(r + 2k)b 
(r+ b)(r+64+k)(r +b + 2k)(r + b+ 3k) 


. This problem illustrates the importance of relying on the rules of conditional probability rather than 


on intuition to obtain the answer. Intuitively, but incorrectly, it might seem that since the observed 
side is green, and since the other side might be either red or green, the probability that it will be 
green is 1/2. The correct analysis is as follows: Let A be the event that the selected card is green on 
both sides, and let B be the event that the observed side is green. Since each of the three cards is 
equally likely to be selected, Pr(A) = Pr(AN B) = 1/3. Also, Pr(B) = 1/2. The desired probability is 


1 i 2 
Pr(A | B) = (= ~)=-. 
(41B)=(2)/(5)=5 
0.2 1 
. We know that Pr(A) = 0.6 and Pr(An B) = 0.2. Therefore, Pr(B | A) = ie 


. In Exercise 2 in Sec. 1.10 it was found that Pr(AU BUC) = 0.85. Since Pr(A) = 0.6, it follows that 


0.60 12 

0.85 17° 

(a) If card A has been selected, each of the other four cards is equally likely to be the other selected 
card. Since three of these four cards are red, the required probability is 3/4. 


Pr(A| AUBUC) = 


(b) We know, without being told, that at least one red card must be selected, so this information does 
not affect the probabilities of any events. We have 
4 3-3 
Pr(both cards red) = Pr(R1) Pr(Re | Ri) = 545 
As in the text, let 7 stand for the probability that the sum on the first roll is either 7 or 11, and let 
a; be the probability that the sum on the first roll is i for 2 = 2,...,12. In this version of the game of 
craps, we have 


D 
™mO = 9° 
3 
7 «2 36 =. 
™ = ™0-36°3 62 ~ a? 
36° 36 36 
4 
= _# 36 1 
a 36, , oO, 2 WT 
36° 36 36 
5 
7 5 36 25 
Ae ee BG: 5, 2 18 
36° 36° 36 


The probability of winning, which is the sum of these probabilities, is 0.448. 
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11. This is the conditional version of Theorem 1.5.3. From the definition of conditional probability, we 


have 
Pr(a‘ia) = 
1—Pr(A|B) = 1- are 
Pr(B) aa NB) 6.2.1) 


According to Theorem 1.5.6 (switching the names A and B), Pr(B) — Pr(An B) = Pr(A°N B). 
Combining this with (S.2.1) yields 1 — Pr(A|B) = Pr(A‘|B). 


12. This is the conditional version of Theorem 1.5.7. Let Ay = AN D and Ag = BND. Then Aj U Ag = 
(AU B)ND and Ay MN Ag = AN BND. Now apply Theorem 1.5.7 to determine Pr(A, U Ag). 


Pr({A U B| fi D) = Pr(A; U Ag) = Pr(Aj;) + Pr(Ag) _— Pr(A, al Ag) = Pr(A al D) 
+Pr(BND)—Pr(ANBND). 


Now, divide the extreme left and right ends of this string of equalities by Pr(D) to obtain 


Pr(AUB|D) = ae ae 7 ao ee 


= Pr(A|D)+Pr(B|D) — Pr(An BID). 


13. Let A; denote the event that the selected coin has a head on each side, let Ag denote the event that it 
has a tail on each side, let A3 denote the event that it is fair, and let B denote the event that a head 
in obtained. Then 


3 4 2 
Pr(A;) = 9° Pr(Ap) = 9° Pr(A3) i 9” 
Pr(B| 41) = 1, Pr(B|42)=0, Pr(B| As) = 5. 
Hence, 
3 
Pr(B) =) Pr(A;) Pr(B | Aj) = =. 
t=] 


14. We partition the space of possibilities into three events B,, Bo, B3 as follows. Let By, be the event that 
the machine is in good working order. Let Bz be the event that the machine is wearing down. Let B3 be 
the event that it needs maintenance. We are told that Pr(B,) = 0.8 and Pr(B2) = Pr(Bs) = 0.1. Let 
A be the event that a part is defective. We are asked to find Pr(A). We are told that Pr(A|B,) = 0.02, 
Pr(A|Bo) = 0.1, and Pr(A|Bs) = 0.3. The law of total probability allows us to compute Pr(A) as 
follows 


3 
Pr(A) = > Pr(B;) Pr(A|B;) = 0.8 x 0.02 + 0.1 x 0.1+0.1 x 0.3 = 0.056. 
j=l 
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15. The analysis is similar to that given in the previous exercise, and the probability is 0.47. 


16. In the usual notation, we have 


Pr(Bo) = Pr(A, a) Bz) + Pr(By MN Bo) = Pr(A;) Pr( Bz | A;) + Pr(B,) Pr( Bz | B,) 
_12,31_5 
a Se ee 


17. Clearly, we must assume that Pr(B;C) > 0 for all j, otherwise (2.1.5) is undefined. By applying the 
definition of conditional probability to each term, the right side of (2.1.5) can be rewritten as 


yp eal DOP An Es NC) 
Pr( Pr(B; NC) = ay 


i=1 =I. 


). 


According to the law of total probability, the last sum above is Pr(AMC), hence the ratio is Pr(A|C). 


2.2 Independent Events 


Commentary 


Near the end of this section, we introduce conditionally independent events. This is a prelude to conditionally 
independent and conditionally i.i.d. random variables that are introduced in Sec. 3.7. Conditional indepen- 
dence has become more popular in statistical modeling with the introduction of latent-variable models and 
expert systems. Although these models are not introduced in this text, students who will encounter them in 
the future would do well to study conditional independence early and often. 

Conditional independence is also useful for illustrating how learning data can change the distribution of 
an unknown value. The first examples of this come in Sec. 2.3 after Bayes’ theorem. The assumption that 
a sample of random variables is conditionally i.i.d. given an unknown parameter is the analog in Bayesian 
inference to the assumption that the random sample is i.i.d. marginally. Instructors who are not going to 
cover Bayesian topics might wish to bypass this material, even though it can also be useful in its own right. 
If you decide to not discuss conditional independence, then there is some material later in the book that you 
might wish to bypass as well: 


e Exercise 23 in this section. 

e The discussion of conditionally independent events on pages 81-84 in Sec. 2.3. 

e Exercises 12, 14 and 15 in Sec. 2.3. 

e The discussion of conditionally independent random variables that starts on page 163. 
e Exercises 13 and 14 in Sec. 3.7. 

e Virtually all of the Bayesian material. 


This section ends with an extended example called “The Collector’s Problem”. This example combines 
methods from Chapters 1 and 2 to solve an easily stated but challenging problem. 
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Solutions to Exercises 


1. If Pr(B) < 1, then Pr(B°) = 1— Pr(B) > 0. We then compute 


Pr(A® 7 B°) 
Pr( B*) 
1—Pr(AUB) 
1 — Pr(B) 
1 — Pr(A) — Pr(B) + Pr(An B) 
1— Pr(B) 
1 — Pr(A) — Pr(B) + Pr(A) Pr(B) 
1 — Pr(B) 
fl — Pr(A)][1 — Pr(B)) 
1 — Pr(B) 
= 1-—Pr(A) = Pr(A‘). 


Pr(A°|B°) 


Pr(A°B°) = Pr{(AUB)) =1-—Pr(A U B) 
= 1-—[Pr(A) + Pr(B) — Pr(An B)| 
= 1-—Pr(A) — Pr(B) + Pr(A) Pr(B)] 
= [1 —Pr(4)][. — Pr(B) 
= Pr(A*) Pr B*), 


3. Since the event AN B is a subset of the event A, and Pr(A) = 0, it follows that Pr(AN B) = 0. Hence, 
Pr ANB) =0= Pr(A) Pre). 


4. The probability that the sum will be seven on any given roll of the dice is 1/6. The probability that 
this event will occur on three successive rolls is therefore (1/6)°. 


5. The probability that both systems will malfunction is (0.001)? = 10~°. The probability that at least 
one of the systems will function is therefore 1 — 10~°. 


6. The probability that the man will win the first lottery is 100/10000 = 0.01, and the probability that 
he will win the second lottery is 100/5000 = 0.02. The probability that he will win at least one lottery 
is, therefore, 


0.01 + 0.02 — (0.01)(0.02) = 0.0298. 
7. Let FE, be the event that A is in class, and let Ez be the event that B is in class. Let C be the event 
that at least one of the students is in class. That is, C = Ey U Eo. 


(a) We want Pr(C). We shall use Theorem 1.5.7 to compute the probability. Since EF, and E> are 
independent, we have Pr(£, 7 E2) = Pr(£)) Pr(£2). Hence 


Pr(C) = Pr(E,) + Pr(E2) — Pr(E, N Ex) = 0.8 + 0.6 — 0.8 x 0.6 = 0.92. 


(b) We want Pr(£1|C). We computed Pr(C) = 0.92 in part (a). Since Ey C C, Pr(£iNC) = Pr(f) = 
0.8. So, Pr(E,|C) = 0.8/0.92 = 0.8696. 
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8. The probability that all three numbers will be equal to a specified value is 1/6°. Therefore, the 
probability that all three numbers will be equal to any one of the six possible values is 6/6? = 1/36. 


9. The probability that exactly n tosses will be required on a given performance is 1/2”. Therefore, the 
probability that exactly n tosses will be required on all three performances is (1/2”)? = 1/8". The 
= 1 


probability that the same number of tosses will be required on all three performances is >. mo 


10. The probability p; that exactly 7 children will have blue eyes is 


i (;) een fie = Odie 


The desired probability is 


p3 + Pa + D5 
Pi + p2 + p3 + pa + Ps 


11. (a) We must determine the probability that at least two of the four oldest children will have blue eyes. 
The probability p; that exactly 7 of these four children will have blue eyes is 


o=Q@ 


The desired probability is therefore po + p3 + pa. 
(b) The two different types of information provided in Exercise 10 and part (a) are similar to the two 


different types of information provided in part (a) and part (b) of Exercise 9 of Sec. 2.1. 


12. (a) Pr(A°n Be nC = Pri(A*) Pr(B*) Pr(C*) = eee 


(b) The desired probability is 


i 2. a 
Pr(AN BSNC’) + Pr(APN BNC) + Pr(A°N BNC) = 7-3-5 


13. The probability of obtaining a particular sequence of ten particles in which one particle penetrates the 
shield and nine particles do not is (0.01)(0.99)°. Since there are 10 such sequences in the sample space, 
the desired probability is 10(0.01)(0.99)?. 


14. The probability that none of the ten particles will penetrate the shield is (0.99)!°. Therefore, the 
probability that at least one particle will penetrate the shield is 1 — (0.99)!°. 


15. If n particles are emitted, the probability that at least one particle will penetrate the shield is 1—(0.99)”. 
In order for this value to be at least 0.8 we must have 


1—(0.99)" > 08 
(0.99)" < 0.2 
nlog(0.99) <_ log (0.2). 


16. 


17. 


18. 
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Since log(0.99) is negative, this final relation is equivalent to the relation 


log (0.2) 


~ 160.1. 
= 1o8(0.99) 


So 161 or more particles are needed. 


To determine the probability that team A will win the World Series, we shall calculate the probabilities 

that A will win in exactly four, five, six, and seven games, and then sum these probabilities. The 

probability that A will win four straight game is (1/3)4. The probability that A will win in five games 

is equal to the probability that the fourth victory of team A will occur in the fifth game. As explained 
1 2 


4 4 
in Example 2.2.8, this probability is ()) (5) (5). Similarly, the probabilities that A will win in six 


B\ ely oye 6) f1\> (2\° 
games and in seven games are al \a 3 and al Va 3 , respectively. By summing these 


6 /. 4 i-3 
probabilities, we obtain the result > (;) (5) (5) , which equals 379/2187. 

i=3 
A second way to solve this problem is to pretend that all seven games are going to be played, regardless 
of whether one team has won four games before the seventh game. From this point of view, of the 
seven games that are played, the team that wins the World Series might win four, five, six, or seven 
games. Therefore, the probability that team A will win the series can be determined by calculating the 
probabilities that team A will win exactly four, five, six, and seven games, and then summing these 


probabilities. In this way, we obtain the result 


= ()G) @) 


v 
It can be shown that this answer is equal to the answer that we obtained first. 


In order for the target to be hit for the first time on the third throw of boy A, all five of the following 

independent events must occur: (1) A misses on his first throw, (2) B misses on his first throw, (3) 

A misses on his second throw, (4) B misses on his second throw, (5) A hits on his third throw. The 
bability of all five events occurring is ee se = z= 

Espen ped ee A 8. 


Let E denote the event that boy A hits the target before boy B. There are two methods of solving 
this problem. The first method is to note that the event EF can occur in two different ways: (i) If A 
hits the target on the first throw. This event occurs with probability . (ii) If both A and B miss the 
target on their first throws, and then subsequently A hits the target before B. The probability that 
A and B will both miss on their first throws is ; -— = —. When they do miss, the conditions of the 


game become exactly the same as they were at the beginning of the game. In effect, it is as if the boys 
were starting a new game all over again, and so the probability that A will subsequently hit the target 
before B is again Pr(£). Therefore, by considering these two ways in which the event F can occur, we 
obtain the relation 

1 


Pr(E) = 5 + 5 Px) . 


2 
The solution is Pr(£) = 3° 


oe 


19. 
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The second method of solving the problem is to calculate the probabilities that the target will be hit 
for the first time on boy A’s first throw, on his second throw, on his third throw, etc., and then to sum 
these probabilities. For the target to be hit for the first time on his jth throw, both A and B must 
miss on each of their first 7 — 1 throws, and then A must hit on his next throw. The probability of this 
event is 


G) GJ @-G) @): 


Let A; denote the event that no red balls are selected, let Ag denote the event that no white balls are 
selected, and let A3 denote the event that no blue balls are selected. We must determine the value 
of Pr(A; U Ag U A3). We shall apply Theorem 1.10.1. The event A; will occur if and only if all ten 
selected balls are white or blue. Since there is probability 0.8 that any given selected ball will be white 
or blue, we have Pr(A;) = (0.8)!°. Similarly, Pr(A2) = (0.7)'° and Pr(A3) = (0.5)!°. The event A, A2 
will occur if and only if all ten selected balls are blue. Therefore Pr(A; 9 Az) = (0.5)!°. Similarly, 
Pr(AgMA3) = (0.2)! and Pr(A, A3) = (0.3)!°. Finally, the event AyMA2M A3 cannot possibly occur, 
so Pr(A, NM Az MN Ag) = 0. So, the desired probability is 


(Og) + (0.7) + 05)" = (65) ~ (02) =(6.3)" = 01356. 


To prove that B,,...,B, are independent events, we must prove that for every subset of r of these 
events (r =1,...,k), we have 


PriBy Asach) Bz.) = Pr(B;, ) ara Pri Bi ). 


We shall simplify the notation by writing simply B),...,B, instead of B;,,...,B;,. Hence, we must 
show that 


Pr(B, N...9 B,) = Pr(By) «+» Pr(B;,). (9.2.2) 


Suppose that the relation (S.2.2) is satisfied whenever B; = A§ for m or fewer values of j and B; = Aj 
for the other k — m or more values of j. We shall show that (S.2.2) is also satisfied whenever B; = A§ 
for m+ 1 values of 7. Without loss of generality, we shall assume that 7 = r is one of these m+ 1 
values, so that B, = A®. It is always true that 


Pr( By Oe. Be) = Pr Bi. Bp) = Pr By... Be 1B). 


Since among the events B),...,B,—1 there are m or fewer values of 7 such that B; = A it follows 
from the induction hypothesis that 


Pr(B, acre i By1) = Pr(B;) aes Pr 8,4): 
Furthermore, since BY = A,, the same induction hypothesis implies that 


Pr( By (age) BpoiB,) = Pr(B;) Ase Pr bpd) Pr (Br) : 


21. 


22. 
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It now follows that 
PE Bi (ese Be) = Pr By) +2 PS = Pees || = Pre) ese Bs) 


Thus, we have shown that if the events B,,...,B, are independent whenever there are m or fewer 
values of j such that B; = Ay, then the events B,,...,B, are also independent whenever there are 
m+ 1 values of 7 such that B; = AS. Since Bj,..., By, are obviously independent whenever there are 
zero values of j such that Bj = A§ (ie., whenever B; = A; for 7 = 1,...,k), the induction argument is 
complete. Therefore, the events B),...,B, are independent regardless of whether B; = A; or B; = AY 
for each value of j. 


For the “only if” direction, we need to prove that if A;,...,A, are independent then 
Pr Ay Os (Ay, Ag Ty Ag) = Pr Ag Kiet Ags); 

for all disjoint subsets {71,...,im} and {j1,...,je} of {1,...,k}. If Ay,..., Ax are independent, then 
Pri Ag (Ag Ag Nee ag) = Prag 1 TiAg) PUA, Osage 


hence it follows that 


PrAg ie 0 Ag. Ag fleet Ag) 


Pr Ay Me 1) Ay, lay, Nes 1 Ag) = Pr(A;, N---N A;,) 
val Je 


= Pr(A;, MA Ain): 
For the “if” direction, assume that Pr(Aj, N---M Aj,,|Aj, A---M Aj,) = Pr(Ai, ++: A;,,) for 
all disjoint subsets {71,...,im} and {j1,...,je} of {1,...,&}. We must prove that A;,...,A,z are 
independent. That is, we must prove that for every subset {s1,...,5n} of {1,...,k}, Pr(As,9---NAs,,) = 
Pr(A,,)---Pr(Asg,,). We shall do this by induction on n. For n = 1, we have that Pr(As,) = Pr(As,) 
for each subset {s;} of {1,...,k}. Now, assume that for all n < no and for all subsets {s1,...,5,} of 


{1,...,k} it is true that Pr(As, M---M As,) = Pr(As,)---Pr(As,,). We need to prove that for every 
subset {t1,...,tno+1} of {1,...,k} 


Pr(Aj, aires Ais ti) = Pr(Az, ) acs Pr Ag. aa): (S.2.3) 
It is clear that 


Pr(Az, MA At = Pr(A;, MA Atno | Az Pr( Ay (S.2.4) 


a) aso mesa) 


We have assumed that Pr(Az,---MAt,, |Atroyi) = Pr(An --NAz,, ) for all disjoint subsets {t1, ... , tng } 
and {tn,+1} of {1,...,k}. Since the right side of this last equation is the probability of the intersection 
of only no events, then we know that 


Pr(A:z, MA Atay ) = Pr( Az, ) me Pr Ag, ): 
Combining this with Eq. (S.2.4) implies that (5.2.3) holds. 


For the “only if” direction, we assume that A, and Ag are conditionally independent given B and we 
must prove that Pr(A2|A; 9 B) = Pr(A2|B). Since A; and A: are conditionally independent given B, 
Pr(A, 9 Ag|B) = Pr(Ai|B) Pr(A2|B). This implies that 


Pr(A, al A2|B) 


Pr(42|B) = Say B) 
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Also, 


— Pr(AiN A420 B) — Pr(AiN 429 B)/Pr(B) — Pr(AiM Aa|B) 
BH Aa Px(A; 7B) 7 Px(A; 7 BY Pr(B) 7 Pr(AiB) 


Hence, Pr(A2|A, NB) = Pr(Ag|B). 

For the “if” direction, we assume that Pr(A2|Ai;9B) = Pr(A2|B), and we must prove that A; and A: are 
conditionally independent given B. That is, we must prove that Pr(A;M Ag|B) = Pr(A,|B) Pr(A2|B). 
We know that 


Pr(A, a A2|B) Pr(B) = Pr(Ag|A, MN B) Pr(A, a. B), 


since both sides are equal to Pr(A; M Ag B). Divide both sides of this equation by Pr(B) and use the 
assumption Pr(A2|A;B) = Pr(Ag|B) together with Pr(A; N B)/ Pr(B) = Pr(A;|B) to obtain 


Pr(A, a) Ap2|B) = Pr(A2|B) Pr(A,|B). 


(a) Conditional on B the events A;,...,A1, are independent with probability 0.8 each. The con- 
ditional probability that a particular collection of eight programs out of the 11 will compile is 


11 
0.8°0.2? = 0.001342. There are 2 |= 165 different such collections of eight programs out of the 


11, so the probability of exactly eight programs will compile is 165 x 0.001342 = 0.2215. 


(b) Conditional on B° the events Aj,...,A1, are independent with probability 0.4 each. The con- 
ditional probability that a particular collection of eight programs out of the 11 will compile is 


11 
0.480.6? = 0.0001416. There are a} 165 different such collections of eight programs out of 
the 11, so the probability of exactly eight programs will compile is 165 x 0.0001416 = 0.02335. 


Let n > 1, and assume that Aj,...,A, are mutually exclusive. For the “if” direction, assume that at 
most one of the events has strictly positive probability. Then, the intersection of every collection of size 
2 or more has probability 0. Also, the product of every collection of 2 or more probabilities is 0, so the 
events satisfy Definition 2.2.2 and are mutually independent. For the “only if” direction, assume that 
the events are mutually independent. The intersection of every collection of size 2 or more is empty 
and must have probability 0. Hence the product of the probabilities of every collection of size 2 or more 
must be 0. This means that at least one factor from every product of at least 2 probabilities must itself 
be 0. Hence there can be no more than one of the probabilities greater than 0, otherwise the product 
of the two nonzero probabilities would be nonzero. 


2.3. Bayes’ Theorem 


Commentary 


This section ends with two extended discussions on how Bayes’ theorem is applied. The first involves a 
sequence of simple updates to the probability of a specific event. It illustrates how conditional independence 
allows one to use posterior probabilities after observing some events as prior probabilities before observing 
later events. This idea is subtle, but very useful in Bayesian inference. The second discussion builds upon this 
idea and illustrates the type of reasoning that can be used in real inference problems. Examples 2.3.7 and 2.3.8 
are particularly useful in this regard. They show how data can bring very different prior probabilities into 
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closer posterior agreement. Exercise 12 illustrates the effect of the size of a sample on the degree to which 
the data can reduce differences in subjective probabilities. 

Statistical software like R can be used to facilitate calculations like those that occur in the above-mentioned 
examples. For example, suppose that the 11 prior probabilities are assigned to the vector prior and that 
the data consist of s successes and f failures. Then the posterior probabilities can be computed by 
ints=1:11 
post=prior*((ints-1)/10)*s*(1-(Cints-1)/10) “f 
post=post/sum (post) 


Solutions to Exercises 
k k 
1. It must be true that ) Pr(B;) = 1 and Sy" Pre; | A) = 1. However, if Pr(B, | A) < Pr(B,) and 
i=1 i=1 
1 v i - 
Pr(B; | A) < Pr(B;) for i = 2,...,k, we would have Fre: | A) < y Pree); a contradiction. 
i=1 i=1 
Therefore, it must be true that Pr(B; | A) > Pr(B;) for at least one value of i (i = 2,...,k). 


2. It was shown in the text that Pr(A2 | B) = 0.26 < Pr(A2) = 0.3. Similarly, 


(0.2)(0.01) 


Pr(A1 | 2) = Tay OGY + 0.3)(0.02) F WHOIS) 


= 0.09. 


Since Pr(A;) = 0.2, we have Pr(A; | B) < Pr(A1). Furthermore, 


(0.5)(0.03) 


WHO + 0.3)(0.02) + OByosy = 2% 


Pr(A3 | B) = 


Since Pr(A3) = 0.5, we have Pr(A3 | B) > Pr(As). 
3. Let C denote the event that the selected item is nondefective. Then 


(0.3) (0.98) 


Pr(421C) = Gaye ga + (0.3)(0.98) + OSNOST) 


= 0.301. 

Commentary: It should be noted that if the selected item is observed to be defective, the probability 
that the item was produced by machine Mg is decreased from the prior value of 0.3 to the posterior 
value of 0.26. However, if the selected item is observed to be nondefective, this probability changes 
very little, from a prior value of 0.3 to a posterior value of 0.301. In this example, therefore, obtaining 
a defective is more informative than obtaining a nondefective, but it is much more probable that a 
nondefective will be obtained. 


4. The desired probability Pr(Cancer | Positive) can be calculated as follows: 


Pr(Cancer) Pr(Positive | Cancer) 
Pr(Cancer) Pr(Positive | Cancer) + Pr(No Cancer) Pr(Positive | No Cancer) 
(0.00001)(0.95) 


= —___ a  _____ 9.90019. 
(0.00001)(0.95) + (0.99999) (0.05) 


Commentary: It should be noted that even though this test provides a correct diagnosis 95 percent of 
the time, the probability that a person has this type of cancer, given that he has a positive reaction to 
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the test, is not 0.95. In fact, as this exercise shows, even though a person has a positive reaction, the 


probability that she has this type of cancer is still only 0.00019. In other words, the probability that the 
0.00019 ) 
= 19), 


person has this type of cancer is 19 times larger than it was before he took this test Pann 


but it is still very small because the disease is so rare in the population. 


5. The desired probability Pr(Lib.|NoVote) can be calculated as follows: 


Pr(Lib.) Pr(NoVote|Lib.) 
Pr(Cons.) Pr(NoVote|Cons.) + Pr(Lib.) Pr(NoVote|Lib.) + Pr(Ind.) Pr(NoVote|Ind.) 
(0.5) (0.18) 18 


(0.3)(0.35) + (0.5)(0.18) + (0.2)(0.50) 59° 


6. (a) Let A; denote the event that the machine is adjusted properly, let Ay denote the event that it 
is adjusted improperly, and let B be the event that four of the five inspected items are of high 
quality. Then 

Pr(A;) Pre | A) 
Pr(Aj;) Pr(B | A) + Pr(Ag) Pr(B | Ag) 


(0.9) (;) (0.5)° Se 
(0.9) (®)(0.5)> + (0.1)(°)(0.25)4(0.75) 97" 


Pr(A, | B) 


(b) The prior probabilities before this additional item is observed are the values found in part (a): 
Pr(A;) = 96/97 and Pr(A2) = 1/97. Let C denote the event that the additional item is of medium 
quality. Then 


96 1 
97.2 64 
Pr(A; | C) = gf 4 T= 
972° 97 4 


7. (a) Let 2; denote the posterior probability that coin i was selected. The prior probability of each coin 
is 1/5. Therefore 


1 
Di 


T= 
a 3 1 
BPI 

j=l? 


The five values are 7; = 0, 72 = 0.1, 73 = 0.2, 74 = 0.3, and m5 = 0.4. 


for i=1,...,5. 


(b) The probability of obtaining another head is equal to 


5 5 
3 
s Pr(Coin 7) Pr(Head | Coin i) = 2 TMipi = -- 
i=l i=1 : 


(c) The posterior probability 7; of coin i would now be 
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Thus, 7, = 0.4, 72 = 0.3, 73 = 0.2,74 = 0.1, and a5 = 0. The probability of obtaining a head on 
5 


the next toss is therefore > Tipji = 
i=1 


8. (a) If coin i is selected, the probability that the first head will be obtained on the fourth toss is 
(1 — p;)°p;. Therefore, the posterior probability that coin i was selected is 


1 
5 — pi)°p; 
m= fori =1,...,5. 
al 
ds 31m) Pj 
j=l 


The five values are 71 = 0, 72 = 0.5870, 73 = 0.3478, 74 = 0.0652, and m5 = 0. 


(b) If coin 7 is used, the probability that exactly three additional tosses will be required to obtain 
another head is (1 — p;)*p;. Therefore, the desired probability is 


5 


S> mi(1 — pi)’pi = 0.1291. 
i=1 


9. We shall continue to use the notation from the solution to Exercise 14 in Sec. 2.1. Let C be the 
event that exactly one out of seven observed parts is defective. We are asked to find Pr(B;|C) for 
j = 1,2,3. We need Pr(C|B;) for each j. Let A; be the event that the ith part is defective. For all 
i, Pr(A;|B1) = 0.02, Pr(A;|Bz) = 0.1, and Pr(A;|B3) = 0.3. Since the seven parts are conditionally 
independent given each state of the machine, the probability of each possible sequence of seven parts 
with one defective is Pr(A;|B;)[1 — Pr(A;|B;)|®. There are seven distinct such sequences, so 


Pr(C|B,) 7 x 0.02 x 0.98° = 0.1240, 
Pr(C|Bo) 7x 0.1 x 0.9° = 0.3720, 
Pr(C|B3) = 7x0.3% 0.7" =0.2471. 


The expression in the denominator of Bayes’ theorem is 
Pr(C) = 0.8 x 0.1240 + 0.1 x 0.3720 + 0.1 x 0.2471 = 0.1611. 


Bayes’ theorem now says 


0.8 x 0.1240 
Pr(B,|C) = a = 0.6157, 
0.1 x 0.3720 
Pr(Bo|C) = — = 0.2309, 
0.1 x 0.2471 


10. Bayes’ theorem says that the posterior probability of each B; is Pr(B;|E) = Pr(B;) Pr(£|B;)/ Pr(£). 
So Pr(B,|E£) < Pr(B;) if and only if Pr(E|B;) < Pr(£). Since Pr(£) = 3/4, we need to find those 7 for 
which Pr(£|B;) < 3/4. These are i = 5,6. 


11. This time, we want Pr(B,4|E°). We know that Pr(£°) = 1-— Pr(£) = 1/4 and Pr(E£*°|By) = 1 - 
Pr(£|B,) = 1/4. This means that E° and By are independent so that Pr(B4|E°) = Pr( Ba) = 1/4. 
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12. We are doing the same calculations as in Examples 2.3.7 and 2.3.8 except that we only have five patients 
and three successes. So, in particular 


5 


Pr(B;) (;) ([j — 1]/10)° (1 — [7 — 1]/10)? 


Pr(B; | A) = (8.2.5) 


11 


> Pr(B,) (;) (fi — 1)/10)8(a —  - a) /10? 
71 


In one case, Pr(B;) = 1/11 for all i, and in the other case, the prior probabilities are given in the table 
in Example 2.3.8 of the text. The numbers that show up in both calculations are 


t=15° a=?! , ai le es ae et 


1 0 7 0.0346 
2 0.0008 8 0.0309 
3 0.0051 9 0.0205 
4 0.0132 10 0.0073 
5 0.0230 11 0 

6 0.0313 


We can use these with the two sets of prior probabilities to compute the posterior probabilities according 


to Eq. (8.2.5). 
i Example 2.3.7 Example 2.3.8 | «7 Example 2.3.7 Example 2.3.8 
if 0 0 7 0.2074 0.1641 
2 0.0049 0.0300 8 0.1852 0.0879 
3 0.0307 0.0972 9 0.1229 0.0389 
4 0.07939 0.1633 10 0.0437 0.0138 
5 0.1383 0.1969 11 0 0 
6 0.1875 0.2077 


These numbers are not nearly so close as those in the examples in the text because we do not have as 
much information in the small sample of five patients. 


13. (a) Let By, be the event that the coin is fair, and let By be the event that the coin has two heads. 
Let H; be the event that we obtain a head on the ith toss for 7 = 1,2,3,4. We shall apply Bayes’ 
theorem conditional on Hy Ho. 


Pr(B,|Ay Hen Hs) 
Pr(B,|Ay MN H2) Pr(H3|By ALN H2) 
Pr(B,|Ay MN Hy) Pr(H3|By ALN Hy) + Pr( Bo|Ay MN Hy) Pr(H3|Bo NAN Hy) 
(1/5) x (1/2) 1 
(1/5) x (1/2) + (4/5) x19 


(b) If the coin ever shows a tail, it can’t have two heads. Hence the posterior probability of B, becomes 
1 after we observe a tail. 


14. In Exercise 23 of Sec. 2.2, B is the event that the programming task was easy. In that exercise, we 
computed Pr(A|B) = 0.2215 and Pr(A|B°) = 0.02335. We are also told that Pr(B) = 0.4. Bayes’ 


Section 2.3. Bayes’ Theorem 39 


theorem tells us that 


Pr(B) Pr(A|B) 7 0.4 x 0.2215 
Pr(B) Pr(A|B) + Pr(B°) Pr(A|B°) 0.4 x 0.2215 + (1 — 0.4)0.02335 
0.8635. 


Pr(B|A) 


15. The law of total probability tells us how to compute Pr(£;). 


a 


11 
—1 

Pr(fi) = Y_ Pr(By) ; 

10 

i=1 
Using the numbers in Example 2.3.8 for Pr(B;) we obtain 0.274. This is smaller than the value 0.5 
computed in Example 2.3.7 because the prior probabilities in Example 2.3.8 are much higher for the B; 
with low values of 7, particularly i = 2,3,4, and they are much smaller for those B; with large values 
of i. Since Pr(F£)) is a weighted average of the values (i — 1)/10 with the weights being Pr(B;) for 
i = 1,...11, the more weight we give to small values of (i — 1)/10, the smaller the weighted average 
will be. 


16. (a) From the description of the problem Pr(D;|B) = 0.01 for all 7. If we can show that Pr(D;|B°) = 
0.01 for all 7, then Pr(D;) = 0.01 for all i. We will prove this by induction. We have assumed that 
Dy, is independent of B and hence it is independent of B°. This makes Pr(D,|B‘°) = 0.01. Now, 
assume that Pr(D;|B°) = 0.01 for all i < 7. Write 


Pr(Dj41|B°) = Pr(Dj41|D; N B°) Pr(Dj|B°) + Pr(Dj41|D§ 9 B°) Pr(D§|B°). 


The induction hypothesis says that Pr(Dj|B°) = 0.01 and Pr(D§|B°) = 0.99. In the problem 
description, we have Pr(Dj+41|Dj; B®) = 2/5 and Pr(Dj41|D§N B°) = 1/165. Plugging these into 
(16a) gives 


2 il 
Pr(D;.,|B°) = = x 0.01 + — x 0.99 = 0.01. 
T(.Dj41|B*) 5 x i 165 x 


This completes the proof. 
(b) It is straightforward to compute 


Pr(E|B) = 0.99 x 0.99 x 0.01 x 0.01 x 0.99 x 0.99 = 0.00009606. 
By the conditional independence assumption stated in the problem description, we have 
Pr(£|B°) = Pr(D{|B°) Pr(D5|D{NB*) Pr(D3|D5NB*°) Pr(D4|D3NB°) Pr(Ds|DaNB°) Pr(Dg|DsNB*). 
The six factors on the right side of this equation are respectively 0.99, 164/165, 1/165, 2/5, 3/5, 


and 164/165. The product is 0.001423. It follows that 


Pr(£|B) Pr(B) 
Pr(E£|B) Pr(B) + Pr(£|B°) Pr(B°) 
0.00009606 x (2/3) 


Sh 1100. 
0.00009606 x (2/3) + 0.001423 x (1/3) 


Pr(B|E) = 
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2.4 The Gambler’s Ruin Problem 


Commentary 


This section is independent of the rest of the book. Instructors can discuss this section at any time that they 
find convenient or they can omit it entirely. 

If Sec. 3.10 on Markov chains has been discussed before this section is discussed, it is helpful to point out 
that the game considered here forms a Markov chain with stationary transition probabilities. The state of 
the chain at any time is the fortune of gambler A at that time. Therefore, the possible states of the chain are 
the & + 1 integers 0,1,...,k. If the chain is in state 7 (¢ =1,...,4—1) at any time n, then at time n+ 1 it 
will move to state 7+ 1 with probability p and it will move to state 7—1 with probability 1—p. It is assumed 
that if the chain is either in the state 0 or the state k at any time, then it will remain in that same state at 
every future time. (These are absorbing states.) Therefore, the (k + 1) x (k + 1) transition matrix P of the 
chain is as follows: 


1 0 0 0 0 00 
l-p 0 p 0 0 00 

0 1l-p 0 p 0 00 

p—| 0 0 1-p 0 0 00 
0 0 0 0 l-p 0 p 

0 0 0 0 0 O01 


Solutions to Exercises 


1. Clearly a; in Eq. (2.4.9) is an increasing function of i. Hence, if agg < 1/2, then a; < 1/2 for all i < 98. 
For i = 98, Eq. (2.4.9) yields almost exactly 4/9, which is less that 1/2. 


2. The probability of winning a fair game is just the ratio of the initial fortune to the total funds available. 
This ratio is the same in all three cases. 


3. If the initial fortune of gambler A is i dollars, then for conditions (a), (b), and (c), the initial fortune of 
gambler B is 1/2 dollars. Hence, k = 3i/2. If we let r = (1 — p)/p > 1, then it follows from Eq. (2.4.8) 
that the probability that A will win under conditions (a), (b), or (c) is 


ri —1 1 — (1/r;) 


r3i/2_— 1 pt/2 — (1 /r;) 
If i and j are positive integers with 7 < j, it now follows that 


1=(I/rj)  1= (fry). 1 =r) 
P= (fry) P= (fry) FPO) 


Thus the larger the initial fortune of gambler A is, the smaller is his probability of winning. Therefore, 
he has the largest probability of winning under condition (a). 


4. If we consider this problem from the point of view of gambler B, then each play of the game is 
unfavorable to her. Hence, by a procedure similar to that described in the solution to Exercise 3, it 
follows that she has the smallest probability of winning when her initial fortune is largest. Therefore, 
gambler A has the largest probability of winning when her initial fortune is largest, which corresponds 
to condition (c). 
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5. In this exercise, p = 1/2 and k =i+2. Therefore a; = i/(i + 2). In order to make a; > 0.99, we must 
have 7 > 198. 


6. In this exercise p = 2/3, and k =i+2. Therefore, by Eq. (2.4.9), 
1 a 
(3) = 
_ 2 


It follows that a; > 0.99 if and only if 
2) > 75.05. 


Therefore, we must have i > 7. 


7. In this exercise p = 1/3 and k =i+2. Therefore, by Eq. (2.4.9) 


— 2-1 — 1-(1/2") 
~ FT TG) 


ay 


— 2 


1 1 
But for every number x (0 < x < 1), we have rl <= re Hence, a; < 1/4 for every positive integer 7. 


8. This problem can be expressed as a gambler’s ruin problem. Suppose that the initial fortunes of both 
gambler A and gambler B are 3 dollars, that gambler A will win one dollar from gambler B whenever 
a head is obtained on the coin, and gambler B will win one dollar from gambler A whenever a tail is 
obtained on the coin. Then the condition that X, = Y, +3 means that A has won all of B’s fortune, and 
the condition that Y, = X, +3 means that A is ruined. Therefore, if p = 1/2, the required probability 
is given by Eq. (2.4.6) with i = 3 and k = 6, and the answer is 1/2. If p £ 1/2, the required probability 
is given by Eq. (2.4.9) with 7 = 3 and k = 6. In either case, the answer can be expressed in the form 


1 
i en 
(—*) +1 
Pp 

9. This problem can be expressed as a gambler’s ruin problem. We consider the initial fortune of gambler 
A to be five dollars and the initial fortune of gambler B to be ten dollars. Gambler A wins one dollar 
from gambler B each time that box B is selected, and gambler B wins one dollar from gambler A each 
time that box A is selected. Since i=5,k = 15, and p = 1/2, it follows from Eq. (2.4.6) that the 


probability that gambler A will win (and box B will become empty) is 1/3. Therefore, the probability 
that box A will become empty first is 2/3. 


2.5 Supplementary Exercises 


Solutions to Exercises 


1. Let Pr(D) = p> 0. Then 


Pr(A) = pPr(A|D)+(1—p)Pr(A | D°) 
> pPr(B| D)+(1—p)Pr(B | D*) = Pr(B). 
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2. 


Lt. 


12. 
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(a) Sample space: 
AT TH 
AAT TTH 


HHHAT TTTH 


(b) Pr(HHT or TTH) = (5 = 5) (5 > 5) =i. 
. _ Pr(AN B) _ Pr(AnB) Pr(ANB) Pr(ANB) _ 2 
. Since Pr(A | B) = PB and Pr(B | A) = P(A)” we have (5) + Ga) 73 


Hence Pr(AN B) = 1/12, and Pr(A® U B®) = 1 — Pr(ANB) = 11/12. 


. Pr(AU B¢ | B) = Pr(A| B) + Pr(B* | B) — Pr(AN Be | B) = Pr(A) +0 —0 = Pr(A). 


3 6 6 
Hence, the probability of obtaining the number 6 on the first three rolls and on none of the subsequent 


; 1? 75! sci. ae, 10 
rolls is b= 6 ak Hence, the required probability is — = 1/ 3 | 
a 


1 1 3 7 
. The probability of obtaining the number 6 exactly three times in ten rolls is a = ( ) (=) (>) : 


Pr(ANBND 0.04 
: Pr(A M B) = maaan = 0.25 = 0.16. But also, by independence, 


Pr(A NM B) = Pr(A) Pr(B) = 4[Pr (A)]?. 
Hence, 4[Pr(A)]? = 0.16, so Pr(A) = 0.2. It now follows that 
Pr(A U B) = Pr(A) + Pr(B) — Pr(AN B) = (0.2) + 4(0.2) — (0.16) = 0.84. 


. The three events are always independent under the stated conditions. The proof is a straightforward 
generalization of the proof of Exercise 2 in Sec. 2.2. 


. No, since Pr(ANM B) = 0 but Pr(A) Pr(B) > 0. This also follows from Theorem 2.2.3. 


. Let Pr(A) =p. Then Pr(AN B) = Pr(AN BNC) =0, Pr(ANC) = 4p”, Pr(BNC) = 8p?. Therefore, 
by Theorem 1.10.1, 5p = p + 2p + 4p — [0 + 4p? + 8p?] + 0, and p= 1/6. 


. Pr(Sum = 7) = 2Pr{(1,6)] + 2 Pr[(2,5)] + 2Pr[(3,4)] = 2(0.1)(0.1) + 2(0.1)(0.1) + 2(0.3)(0.3) = 0.22. 
1 — Pr(losing 50 times) = 1 — (2)". 


The event will occur when (X1, X2, X3) has the following values: 


32,1). 


Each of these 20 points has probability 1/6°, so the answer is 20/216 = 5/54. 
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13. Let A, B, and C' stand for the events that each of the students is in class on a particular day. 


(a) We want Pr(AU BUC). We can use Theorem 1.10.1. Independence makes it easy to compute the 
probabilities of the various intersections. 


Pr(AU BUC) =0.34 0.5 + 0.8 — [0.3 x 0.54 0.3 x 0.8+0.5 x 0.8] + 0.3 x 0.5 x 0.8 = 0.93. 
(b) Once again, use independence to calculate probabilities of intersections. 
Pr(An BSnC*’) + Pr(A°n BNC) + Pr(A°n B°NC) 
= (0.3)(0.5)(0.2) + (0.7)(0.5)(0.2) + (0.7)(0.5)(0.8) = 0.38. 
14. Seven games will be required if and only if team A wins exactly three of the first six games. This 
probability is : p°(1 — p)3, following the model calculation in Example 2.2.5. 
3! 


2 
15. Pr(Each box contains one red ball) = 3 5 Pr(Each box contains one white ball). 


2 
So Pr(Each box contains both colors) = (5) : 


16. Let A; be the event that box 7 has at least three balls. Then 


n? n 


5 5 
Pr(A;) = S| Pr(Box i has exactly 7 balls) = ~+~—~——— + ~+—_~——_- + 7B Ps Say. 
j=3 


Since there are only five balls, it is impossible for two boxes to have at least three balls at the same 
time. Therefore, the events A; are disjoint, and the probability that at least one of the events A; occurs 
is np. Hence, the probability that no box contains more than two balls is 1 — np. 


17. Pr(U + V = 7) is as follows, for 7 = 0,1,...,18: 


j Prob. ] Prob. 
0 0.01 10 0.09 
1 0.02 11 0.08 
2 0.03 12 0.07 
3 0.04 13 0.06 
4 0.05 14 0.05 
5 0.06 15 0.04 
6 0.07 16 0.03 
7 0.08 17 0.02 
8 0.09 18 0.01 
9 0.10 
Thus 

18 
PrU+V=W+X) = SoPr(U+V = )Pr(W+X =) 

j=0 


= (0.01)? + (0.02)? +---+ (0.01)? = 0.067. 
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18. Let A; denote the event that member i does not serve on any of the three committees (i = 1,...,8). 


Then 
() () () 
4 


8 7 


06 
Pr(A;n Aj) = () () () (2-2) (=: =)(=: =) =ij or Sg, 


Pr(A; A.A; 9 Ax) mOROROM (3-7-2) S22) (222 
() (*) (*) € 76 Ne 7 an 7 i) 


= cfori<j<k, 
PrApnAsMiAgnag) = Oa<jek<s. 


Hence, by Theorem 1.10.2, 


r(G.a) =00-(So+(ess 


Therefore, the required probability is 1 — .7207 = .2793. 


19. Let E; be the event that A and B are both selected for committee i (¢ = 1,2,3) and let Pr(E;) = pj. 


Then 
6 6 6 
: 0.1071 : 0.2143 a 0.3571 
PL 8 : > pa= 8 : > P= 8 aid : 
3 4 5 


Since £,, E2, and E3 are independent, it follows from Theorem 1.10.1 that the required probability is 


Pr(f, UE2U £3) = pi +p2+p3 — pipe — pep3 — pip3 + pip2p3 
0.5490. 


2 


20. Let & denote the event that B wins. B will win if A misses on her first turn and B wins on her 


first turn, which has probability : ral or if both players miss on their first turn and B then goes 
5\ (5 
on to subsequently win, which has probability FA >) Pr(E). (See Exercise 17, Sec. 2.2.) Hence, 


cE 
Pr Eh) = (2) (=) + (2) (2) Pr( #), aud Pri) = =. This problem could also be solved by summing 


onto (3) (3) (8) (8) +) (@) = 


21; 


22. 


23. 


24. 


20. 


26. 
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A will win if he wins on his first toss (probability 1/2) or if all three players miss on their first tosses 
(probability 1/8) and then A subsequently wins. Hence, 


1 1 
Pr(A wins) = 5 + 3 Pr(A wins), 


and Pr(A wins) = 4/7. 


Similarly, B will win if A misses on his first toss and B wins on his first toss, or if all three players miss 
on their first tosses and then B subsequently wins. Hence, 


dl 
Pr(B wins) = Z + a Pr(B wins), 


and Pr(B wins) = 2/7. Thus, Pr(C wins) = 1 — 4/7 — 2/7 = 1/7. 


Let A; denote the outcome of the jth roll. Then 


Pr(X = xr) = Pr(Ag # Ay, A3 x Ao, ee ,Ap_1 # Ay—2, Ay = Ag=i). 
Pr( Ag 4 Aj) Pr(A3 # Ag | Ag # Aj) Pe -Pr(Az _ Ay al | A, 1 tA, 2, etc.). 


-@-O0-O"'@ 


a—2 factors 


Let A be the event that the person you meet is a statistician, and let B be the event that he is shy. 
Then 


(0.8) (0.1) 


Pr(Al B) = Gai s isos) 8" 
0.05) (0.2 
Pr(A | lemon) = @aEy GD) Sie Te (0.1)(0.3) — 5: 
7 (0.9)(0.3) 2, 
(a) Pr(Defective | Removed) = 0903) +0D07) a7 0.659. 
(0.1)(0.3) 3 


= — = 0.051. 


(b) Pr(Defective | Not Removed) = (0.1)(0.3) + (0.8)(0.7) 59 


Let X and Y denote the number of tosses required on the first experiment and second experiment, 
respectively. Then X = n if and only if the first n — 1 tosses of the first experiment are tails and the 
nth toss is a head, which has probability 1/2”. Furthermore, Y > n if and only if the first n tosses of 
the second experiment are all tails, which also has probability 1/2”. 

Hence 


Pry >A) = hy sn tonne =n 
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27. Let A denote the event that the family has at least one boy, and B the event that it has at least one 


girl. Then 
Pr(B) = 1-(1/2)", 
Pr(ANB) = 1-—Pr(All girls) — Pr(All boys) = 1 — (1/2)” — (1/2)”. 
Hence, 
Pr(A| B) = Pr(ANB) _ 1-(1/2)""1 


Pr(B) = 1—(1/2)" 


28. (a) Let X denote the number of heads, Then 


- _ Pr(X =n-1) 
i eae ame ae ge 
("oe i 2n 


(,.".) a (.") + (")] (1/2)" mnt) Lg. Tee 


(b) The required probability is the probability of obtaining exactly one head on the last two tosses, 
namely 1/2. 


29. (a) Let X denote the number of aces selected. 


Then 
(i)( : 
i/\13—%4 
Pr(X =i) = A+ +r —_,, i= 0,1,2,3,4. 
52 
ules 


1—Pr(X =0) — Pr(X = 1) 


1— Pr(X =0) 
~  1=0.3038 — 0.4388 _ 9 gage 
1 — 0.3038 


(b) Let A denote the event that the ace of hearts and no other aces are obtained, and let H denote 
the event that the ace of hearts is obtained. 
Then 


48 
Pr(A = 0.1097, Pr(H = 0.25 
r( ) = 752\_ . 3 r( ) = 52 = Uz. : 
13 
The required probability is 


Pr(H) —Pr(A) _ 0.25 — 0.1097 


rw = 0.5612. 
Pr(H) 0.25 


30. 


bl. 


32. 


33. 


34. 


30. 


Section 2.5. Supplementary Exercises 47 


The probability that a particular letter, say letter A, will be placed in the correct envelope is 1/n. 
The probability that none of the other n — 1 letters will then be placed in the correct envelope is 
Qn—1 = 1—pn—1. Therefore, the probability that only letter A, and no other letter, will be placed in the 
correct envelope is gn—1/n. It follows that the probability that exactly one of the n letters will be placed 
in the correct envelope, without specifying which letter will be correctly placed is nqp_—1/n = dn-1.- 


The probability that two specified letters will be placed in the correct envelopes is 1[n(n — 1)]. The 
probability that none of the other n — 2 letters will then be placed in the correct envelopes is qdn—92. 
Therefore, the probability that only the two specified letters, and no other letters, will be placed 


1 
in the correct envelopes is Mea It follows that the probability that exactly two of the n 
n(n — 
letters will be placed in the correct envelopes, without specifying which pair will be correctly placed, 
. [nr 1 1 
is ——~dn-2 = <In-2. 


The probability that exactly one student will be in class is 
Pr(A) Pr(B°) + Pr(A°) Pr(B) = (0.8)(0.4) + (0.2)(0.6) = 0.44. 
The probability that exactly one student will be in class and that student will be A is 


Pr(A) Pr(B°) = 0.32. 


32 8 

H th ired probability is — = —. 

ence, the required probability is 77 = 7, 

By Exercise 3 of Sec. 1.10, the probability that a family subscribes to exactly one of the three newspapers 

is 0.45. As can be seen from the solution to that exercise, the probability that a family subscribes only 
to newspaper A is 0.35. Hence, the required probability is 35/45 = 7/9. 


A more reasonable analysis by prisoner A might proceed as follows: The pair to be executed is equally 
likely to be (A, B), (A,C), or (B,C). If it is (A, B) or (A,C), the jailer will surely respond B or C, 
respectively. If it is (B,C), the jailer is equally likely to respond B or C. Hence, if the jailer responds 
B, the conditional probability that the pair to be executed is (A, B) is 


1- Pr(A, B) 


Pr[(A, B) | response] = —§$_————>_,——_. 
1-Pr(A, B) +0: Pr(A,C) + 5 Pr(.6,C) 


Thus, the probability that A will be executed is the same 2/3 as it was before he questioned the jailer. 
This answer will change if the probability that the jailer will respond B, given (B,C), is assumed to 
be some value other than 1/2. 


The second situation, with stakes of two dollars, is equivalent to the situation in which A and B have 
initial fortunes of 25 dollars and bet one dollar on each play. In the notation of Sec. 2.4, we have 7 = 50 
and k = 100 in the first situation and 7 = 25 and k = 50 in the second situation. Hence, if p = 1/2, 
it follows from Eq. (2.4.6) that gambler A has the same probability 1/2 of ruining gambler B in either 
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situation. If p 4 1/2, then it follows from Eq. (2.4.9) that the probabilities a; and a2 of winning in the 
two situations equal the values 


oo = Lhe pl/p)=1 1 
‘(l= pl/p)0 1 (ft — p/p) 41 
oy = hese) -1 _ 1 


(t-pl/p)-1 (lh —pl/py +0 


Hence, if p < 1/2, then ({1 — p]/p) > 1 and ag > qy. If p > 1/2, then ((1 — p|/p) < 1 and a; > ag. 


36. (a) Since each candidate is equally likely to appear at each point in the sequence, the one who happens 
to be the best out of the first i has probability r/c of appearing in the first r interviews when 
t>T. 

(b) Ifi <r, then A and B; are disjoint and Pr(AM B;) = 0 because we cannot hire any of the first r 
candidates. So Pr(A|B;) = Pr(AN B,)/ Pr(B;) = 0. Next, let i > r and assume that B; occurs. 
Let C; denote the event that we keep interviewing until we see candidate 7. If C; also occurs, then 
we shall rank candidate 7 higher than any of the ones previously seen and the algorithm tells us 
to stop and hire candidate 7. In this case A occurs. This means that B; MC; C A. However, if C; 
fails, then we shall hire someone before we get to interview candidate i and A will not occur. This 
means that B;NCESN A = GO. Since B;N A = (B;NC;N A) U(ByN CEN A), we have B3N A = BNC; 
and Pr(B;N A) = Pr(B;NC;). So Pr(A|B;) = Pr(C;|B;). Conditional on B;, C; occurs if and only 
if the best of the first 7 — 1 candidates appears in the first r positions. The conditional probability 
of C; given B; is then r/(i — 1). 

(c) If we use the value r > 0 to determine our algorithm, then we can compute 


ih rn GS 1 
pr = Pr(A) = 5 Pr(B,) Pr(A|B) =~ =a Ge S- a 
i=l i=rt+1 at i=r+l 


For r = 0, if we take r/r = 1, then only the first term in the sum produces a nonzero result and 
po = 1/n. This is indeed the probability that the first candidate will be the best one seen so far 
when the first interview occurs. 


(d) Using the formula for p, with r > 0, we have 


_ il 3 1 1 
dr = Pr is a= an | ) 


i=r+1 


which clearly decreases as r increases because the terms in the sum are the same for all r, but 
there are fewer terms when r is larger. Since all the terms are positive, gq, is strictly decreasing. 


(e) Since p, = dr + pr-1 for r > 1, we have that p, = pp +q1 +--:+4q,. If there exists r such that 
qr < 0, then q; < 0 for all 7 > r and p; < p,_y for all 7 > r. On the other hand, for each r such 
that q, > 0, p- > pr—1. Hence, we should choose r to be the last value such that gq, > 0. 


(f) For n = 10, the first few values of g, are 


r 1 2 3 4 
dr | 0.1829 0.0829 0.0390 —0.0004 


So, we should use r = 3. We can then compute p3 = 0.3987. 


Chapter 3 


Random Variables and Distributions 


3.1 Random Variables and Discrete Distributions 


Solutions to Exercises 


I, 


10 10 
. For « = 0,1,...,10, the probability of obtaining exactly x heads is ( (5) ‘ 
x 


7 3 10 
. For « = 2,3,4,5, the probability of obtaining exactly x red balls is ( ( ) i( ; 
x 


Each of the 11 integers from 10 to 20 has the same probability of being the value of X. Six of the 11 
integers are even, so the probability that X is even is 6/11. 


5 


. The sum of the values of f(a) must be equal to 1. Since S- f(x) = 15c, we must have c = 1/15. 


=]. 


. By looking over the 36 possible outcomes enumerated in Example 1.6.5, we find that X = 0 for 6 


outcomes, X = 1 for 10 outcomes, X = 2 for 8 outcomes, X = 3 for 6 outcomes, X = 4 for 4 outcomes, 
and X = 5 for 2 outcomes. Hence, the p.f. f(x) is as follows: 


x 0 1 2 3 4 5 
f(z) [3/18 5/18 4/18 3/18 2/18 1/18 


dl. 
2 


5-2 5 


. The desired probability is the sum of the entries for k = 0, 1, 2, 3, 4, and 5 in that part of the table of 


binomial probabilities given in the back of the book corresponding to n = 15 and p= 0.5. The sum is 
0.1509. 


. Suppose that a machine produces a defective item with probability 0.7 and produces a nondefective 


item with probability 0.3. If X denotes the number of defective items that are obtained when 8 items 
are inspected, then the random variable X will have the binomial distribution with parameters n = 8 
and p = 0.7. By the same reasoning, however, if Y denotes the number of nondefective items that are 
obtained, then Y will have the binomial distribution with parameters n = 8 and p = 0.3. Furthermore, 
Y = 8-—X. Therefore, X > 5 if and only if Y < 3 and it follows that Pr(X > 5) = Pr(Y < 3). 
Probabilities for the binomial distribution with n = 8 and p = 0.3 are given in the table in the back of 
the book. The value of Pr(Y < 3) will be the sum of the entries for k = 0,1, 2, and 3. 
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8. The number of red balls obtained will have the binomial distribution with parameters n = 20 and 
p= 0.1. The required probability can be found from the table of binomial probabilities in the back of 
the book. Add up the numbers in the n = 20 and p = 0.1 section from k = 4 to k = 20. Or add up the 
numbers from k = 0 to k = 3 and subtract the sum from 1. The answer is 0.1330. 


9. We need 3792.9 f(x) = 1, which means that c = 1/ 5°92.) 2-*. The last sum is known from Calculus to 
equal 1/(1 — 1/2) = 2, soc = 1/2. 


7 
10. (a) The p-f. of X is f(x) = c(a+1)(8—2) for z =0,...,7 where c is chosen so that > J(@)= 1. So, 8 
x=0 
7 
is one over S > (#+1)(8-2), which sum equals 120, soc = 1/120. That is f(a) = (a+1)(8—2)/120 
«=0 
for « =0,...,7. 


(b) Pr(X > 5) = [(6 + 1)(8 — 5) + (6+ 1)(8 — 6) + (7 + 1)(8 — 7)]/120 = 1/3. 
11. In order for the specified function to be a p.f., it must be the case that S- £ = (bor equivalently 
x 


y 


c=] 


x1 


ale 
| 


1 = 
= —. But > — =o, so there cannot be such a constant c. 
Cc = 


3.2 Continuous Distributions 


Commentary 


This section ends with a brief discussion of probability distributions that are neither discrete nor continuous. 
Although such distributions have great theoretical interest and occasionally arise in practice, students can 
go a long way without actually concerning themselves about these distributions. 


Solutions to Exercises 
1. We compute Pr(X < 8/27) by integrating the p.d.f. from 0 to 8/27. 
8 i) a 8/27 A 
a a 2 =1i/3 4. = 42/3 == 
Pr (x < =) i 32 dx = x : 9 


2. The p.d.f. has the appearance of Fig. 8.3.1. 


A 
# 
3 


0 1 x 


Figure $.3.1: Figure for Exercise 2 of Sec. 3.2. 


(a) Pr (x < 5) = 7 A(1 — x°)dx/3 = 0.6458. 
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1 3 3/4 i 
Hey -exre= -| A(1 — 23)dx/3 = 0.5625. 
4 4 1/4 


(c) Pr (x > 5) = [a0 — 23)dx/3 = 0.5597. 


3. The p.d.f. has the appearance of Fig. 8.3.2. 


-3 0 3 x 


Figure $.3.2: Figure for Exercise 3 of Sec. 3.2. 


0 
(a) Pr(X <0) = - [0 ede 08. 


1 1 
(b) Pr(-1< X <1)= val (9 — 22)dx = 0.4815. 
-1 
1 3 
(c) Pr(X > 2)= x | (9 — x)dx = 0.07407. 
2 


The answer in part (a) could also be obtained directly from the fact that the p.d.f. is symmetric about 
the point z = 0. Therefore, the probability to the left of x = 0 and the probability to the right of « = 0 
must each be equal to 1/2. 


4. (a) We must have 


oe) 2 7 
i f(a)dx = | ca*dx = —c = 1. 
—oo 1 3 
Therefore, c = 3/7. This p.d.f. has the appearance of Fig. S.3.3. 


I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
| 
2 


I 
I 
| 
1 


< 


Figure $.3.3: Figure for Exercise 4a of Sec. 3.2. 
2 
(b) [Fade = 37/56, 
3/2 


| 
5. (a) i grat = 1/4, or 77/16 = 1/4. Hence, ¢ = 2. 
0 


(b) [ers dx = 1/2, or 1— 7/16 = 1/2. Hence, t = V8. 
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6. The value of X must be between 0 and 4. We will have 


0 f0<X <1/2, 
t fio< xX 23/2, 

Y=4 2 G#3/0< xX = 5/2, 
B 5X 27/2, 
4 if7/2<X <A. 


We need not worry about how to define Y if X € {1/2,3/2,5/2,7/2}, because the probability that X 
will be equal to one of these four values is 0. It now follows that 


1/2 ; 1 
Pr(Y = = =e 
(¥=0) = f° f@de= 5, 
Pr(¥ = 1) [- ieee 
Tr — = 6 =-, 
1/2 v= 3 
5/2 . 1 
Pr(¥ =2 a = 
¥=2) = ff fear =] 
7/2 3 
Pr(¥Y =3) = | (x)dx = 5, 
5/2 8 


4 15 
Pr(¥=4) = [. f(e)de = =. 


7. Since the uniform distribution extends over an interval of length 10 units, the value of the p.d.f. must 
be 1/10 throughout the interval. Hence, 


: 7 
[ {2\de= Tit 


8. (a) We must have 


- f@\de= [FO cexp(-22) de = x ae 


Therefore, c= 2. This p.d.f. has the appearance of Fig. 8.3.4. 


x< 


Figure $.3.4: Figure for Exercise 8a of Sec. 3.2. 
?) 
(b) [ f(e)dz = exp(-2) - exp(-4). 


[oe] [oe] 
9. Since | 1/(1 + x) dx = ov, there is no constant c such that ‘ i (eee = 1. 
0 0 
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10. (a) We must have 


Cc 


lo) 1 
[flea = | Gon ge 2e= 1. 
Therefore c = 1/2. This p.d.f. has the appearance of Fig. S.3.5. 


0 
Figure $.3.5: Figure for Exercise 10a of Sec. 3.2. 


It should be noted that although the values of f(a) become arbitrarily large in the neighborhood 
of x = 1, the total area under the curve is equal to 1. It is seen, therefore, that the values of a 
p.d.f. can be greater than 1 and, in fact, can be arbitrarily large. 


(b) - (x) dx =1—(1/2)"/?. 
0 


dl 1 
11. Since | (1/x)dx = oo, there is no constant c such that | faze 1, 
0 0 


12. We shall find the c.d.f. of Y and evaluate it at 50. The c.d.f. of arandom variable Y is F'(y) = Pr(Y < y). 
In Fig. 3.1, on page 94 of the text, the event {Y < y} has area (y — 1) x (200 — 4) = 196(y — 1) if 
1<y< 150. We need to divide this by the area of the entire rectangle, 29,204. The c.d.f. of Y is then 

0 for y <1, 
196(y — 1) 
F Me ——W— forl<y<l 
ww) omg 
1 for y > 150. 
So, in particular, Pr(Y < 50) = 0.3289. 


13. We find Pr(X < 20) = ic crdz = 200c. Setting this equal to 0.9 yields c = 0.0045. 


3.3. The Cumulative Distribution Function 


Commentary 


This section includes a discussion of quantile functions. These arise repeatedly in the construction of hy- 
pothesis tests and confidence intervals later in the book. 
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Solutions to Exercises 


1. The c.d.f. F(x) of X is 0 for x < 0. It jumps to 0.3 = Pr(X = 0) at x = 0, and it jumps to 1 and stays 
there at « = 1. The c.d.f. is sketched in Fig. $.3.6. 


Figure $.3.6: C.d.f. of X in Exercise 1 of Sec. 3.3. 


2. The c.d.f. must have the appearance of Fig. $.3.7. 


Figure $.3.7: C.d.f. for Exercise 2 of Sec. 3.3. 


3. Here Pr(X =n) = 1/2” forn = 1,2,.... Therefore, the c.d.f. must have the appearance of Fig. 8.3.8. 


0.75 — 


fo) 
a 
Ne) 
wo 
RE 
oa 
o>) 
x< 


Figure $.3.8: C.d.f. for Exercise 3 of Sec. 3.3. 


4. The numbers can be read off of the figure or found by subtracting two numbers off of the figure. 
(a) The jump at « = —1 is F(-1) — F( 
(b) The c.d-f. to the left of x = 0 is F(0- 
(¢) Thee.dt. at e= 0 ig F(0) = 0.2. 


== 04. 
= 0.1. 
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(d) There is no jump at x = 1, so Pr(X = 1) = 0. 
(e) F(3) — F(0) = 0.6. 
(f) F(3~) — F(0) = 0.4. 
(g) FG) = 207) =07 
(h) F(2)— F(1) =0 
(i) FQ)-F(-) =0 
(j) 1— F(5) =0 
(k) 1— F(5-) =0 
(l) F(4) — F(37) =0.2 
for x < 0, 
5. f(e) = FO - 3° for U< 2 <3, 


0 for x > 3. 


The value of f(x) at = 0 and x = 3 is irrelevant. This p.d.f. has the appearance of Fig. S.3.9. 


ores] 
< 


0 


Figure $.3.9: Figure for Exercise 5 of Sec. 3.3. 


_ dF(x) _ J exp(x—3) fora <3, 
m= da =| 0 for x > 3. 


The value of f(x) at x = 3 is irrelevant. This p.d.f. has the appearance of Fig. S.3.10. 


Figure $.3.10: Figure for Exercise 6 of Sec. 3.3. 


55 


It should be noted that although this p.d.f. is positive over the unbounded interval where x < 3, the 


total area under the curve is finite and is equal to 1. 


7. The c.d.f. equals 0 for « < —2 and it equals 1 for x > 8. For —2 < x <8, the c.d.f. equals 


“dy «+2 
Fea) = | ip 0 
-2 


The c.d.f. has the appearance of Fig. $.3.11. 
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Figure $.3.11: Figure for Exercise 7 of Sec. 3.3. 


8. Pr(Z < z) is the probability that Z lies within a circle of radius z centered at the origin. This probability 
is 


Area of circle of radius z 2 tegeee i 
ee, HS r 7 
Mreacforleutrmdus. °° ~~ 


The c.d.f. is plotted in Fig. $.3.12. 


| 
I 
I 
} 
| 
I 
! 
0 1 


Figure $.3.12: C.d.f. for Exercise 8 of Sec. 3.3. 


9. Pr(Y = 0) = Pr(X < 1) = 1/5 and Pr(Y = 5) = Pr(X > 3) = 2/5. Also, Y is distributed uniformly 
between Y = 1 and Y = 3, with a total probability of 2/5. Therefore, over this interval F(y) will be 
linear with a total increase of 2/5. The c.d.f. is plotted in Fig. $.3.13. 


10. To find the quantile function F~!(p) when we know the c.d-f., we can set F(x) = p and solve for z. 


S =p w=ptpe; 2(1 =o 2S 
[ag Pi T= PTR; P)=P) B= To. 


The quantile function is F~!(p) = p/(1 — p). 


11. As in Exercise 10, we set F(x) = p and solve for z. 


1 9 


9” =p; x7 =9p; x = 3p'/?. 


The quantile function of X is F~!(p) = 3p'/?. 


12. 


13. 


14. 


15. 


16. 
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F(y) 
H ——_____ 
0.6; —— 
a ! 
J | 
va | 
fo ! ! 
0.27 
012 3 4 5 y 


Figure $.3.13: C.d.f. for Exercise 9 of Sec. 3.3. 


Once again, we set F(a) = p and solve for a. 
exp(z —3) =p; x—3=log(p); x=3 + log(p). 
The quantile function of X is F~!(p) = 3 + log(p). 


VaR. at probability level 0.95 is the negative of the 0.05 quantile. Using the result from Example 3.3.8, 
the 0.05 quantile of the uniform distribution on the interval [—12, 24] is 0.05 x 24 — 0.95 x 12 = —10.2. 
So, VaR at probability level 0.95 is 10.2 


Using the table of binomial probabilities in the back of the book, we can compute the c.d.f. F' of 
the binomial distribution with parameters 10 and 0.2. We then find the first values of x such that 
F(a) > 0.25, F(x) > 0.5, and F(a) > 0.75. The first few distinct values of the c.d-f. are 


x 0 1 2 3 
F(a) | 0.0174 0.3758 0.6778 0.8791 


So, the quartiles are 1 and 3, while the median is 2. 


Since f(x) = 0 for x < 0 and for x > 1, the c.d-f. F(x) will be flat (0) for x < 0 and flat (1) for x > 1. 
Between 0 and 1, we compute F(x) by integrating the p.d.f. For 0 <a <1, 


au 
F(z) = | 2ydy = x”. 
0 
The requested plot is identical to Fig. $.3.12 for Exercise 8 in this section. 


For each 0 < p < 1, we solve for x in the equation F(x) = p, with F specified in (3.3.2): 


l—p 
The quantile function is F~!(p) = 1/(1 —p) —1 for 0 < p< 1. 


58 


Te. 


18. 


19. 


20. 


Chapter 3. Random Variables and Distributions 


(a) Let 0 < py < po < 1. Define A; = {x : F(x) > p;} for i = 1,2. Since p, < pe and F is 
nondecreasing, it follows that Ag C A;. Hence, the smallest number in A; (which equals F~!(p1) 
by definition) is no greater than the smallest number in Ay (which equals F~'(p2) by definition. 
That is, F~!(p,) < F~'(p2), and the quantile function is nondecreasing. 

(b) Let 29 = lim F~'(p). We are asked to prove that x is the greatest lower bound of the set 


p>o 

C = {c: F(c) > 0}. First, we show that no x > 2p is a lower bound on C. Let x > zo and 
2, = (x+20)/2. Then x < 21 < x. Because F~!(p) is nondecreasing, it follows that there 
exists p > 0 such that F~1(p) < 21, which in turn implies that p < F(a1), and F(a1) > 0. 
Hence x; € C, and z is not a lower bound on C’. Next, we prove that xg is a lower bound on C. 
Let x € C. We need only prove that xo < x. Because F~'(p) is nondecreasing, we must have 
lim F-!'(p) < F7'(q) for all g > 0. Hence, zp < F~'(p) for all g > 0. Because x € C, we have 
p>o 

F(x) > 0. Let ¢ = F(x) so that q > 0. Then x < F~!(q) < x. The proof that 2 is the least 
upper bound on the set of all d such that F'(d) < 1 is very similar. 


(c) Let 0 < p < 1. Because F~! is nondecreasing, F~!(p~) is the least upper bound on the set 
C = {F-'(q): q < p}. We need to show that F~'(p) is also that least upper bound. Clearly, 
F-1(p) is an upper bound, because F~! is nondecreasing and p > q for all q < p. To see that 
F~'(p) is the least upper bound, let y be an upper bound. We need to show F~1'(p) < y. By 
definition, F~'(p) is the greatest lower bound on the set D = {x : F(x) > p}. Because y is an 
upper bound on C, it follows that F~'(q) < y for all q < p. Hence, F(y) > q for all q < p. Because 
F is nondecreasing, we have F(y) > p, hence y € D, and F~'(p) < y. 


We know that Pr(X =c) = F(c) — F(c_). We will prove that p; = F *(c) and pp = F(c”). For each 
€ (0,1) define 
Cy = {x : F(x) > ph. 

Condition (i) says that, for every p € (po,p1), c is the greatest lower bound on the set Cp. Hence 
F(c) > p for all p < p; and F(c) > py. If F(c) > pi, then for p = (pi; + F(c))/2, F~'(p) < c, and 
condition (iii) rules this out. So F'(c) = p;. The rest of the proof is broken into two cases. First, if 
po = 0, then for every € > 0, c is the greatest lower bound on the set C,. This means that F(x) < e€ for 
all x < c. Since this is true for all e > 0, F(x) = 0 for all x < c, and F(c_) =0 = po. For the second 
case, assume po > 0. Condition (ii) says F~'(po) < c. Since F~'(po) is the greatest lower bound on 
the set Cp,, we have F(x) < po for all « < c. Hence, pp > F(c”). Also, for all p < po, p< F(c”), hence 
po < F(c_). Together, the last two inequalities imply po = Fc”). 


First, we show that F~'(F(x)) < x. By definition F~!(F(z)) is the smallest y such that F(y) > F(2). 
Clearly F(x) > F(x), hence F~!(F(x)) < 2. Next, we show that, if p > F(x), then F~'(p) > x. Let 
p > F(a). By Exercise 17, we know that F~!(p) > 2. By definition, F~'(p) is the greatest lower bound 
on the set Cp = {y: F(y) => p}. Ally € CG, satisfy F(y) > (p+ F(x))/2. Since F is continuous from 
the right, F(F—1(p)) > (p+ F(a))/2. But F(x) < (p+ F(x))/2, so x 4 F-1(p), hence F~1(p) > z. 


Figure 8.3.14 has the plotted c.d.f., which equals 0.004527/2 for 0 < 2 < 20. On the plot, we see that 
F(10) = 0.225. 


3.4 Bivariate Distributions 


Commentary 


The bivariate distribution function is mentioned at the end of this section. The only part of this discussion 
that is used later in the text is the fact that the joint p.d.f. is the second mixed partial derivative of the 
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Figure $.3.14: C.d.f. for Exercise 20 of Sec. 3.3. 


bivariate p.d.f. (in the discussion of functions of two or more random variables is Sec. 3.9.) If an instructor 
prefers not to discuss how to calculate probabilities of rectangles and is not going to cover functions of two 
or more random variables, there will be no loss of continuity. 


Solutions to Exercises 
1. (a) Let the constant value of the p.d.f. on the rectangle be c. The area of the rectangle is 2. So, the 
integral of the p.d.f. is 2c = 1, hence c = 1/2. 


(b) Pr(X > Y) is the integral of the p.d-f. over that part of the rectangle where x > y. This region is 
shaded in Fig. $.3.15. The region is a trapezoid with area 1 x (1+ 2)/2 = 1.5. The integral of the 


Figure $.3.15: Region where x > y in Exercise 1b of Sec. 3.4. 


constant 1/2 over this region is then 0.75 = Pr(X > Y). 


2. The answers are found by summing the following entries in the table: 


(a) The entries in the third row of the table: 0.27. 
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(b) The last three columns of the table: 0.53. 

(c) The nine entries in the upper left corner of the table: 0.69 
(d)}..(0;.0), (1, 1), (2,2), sind (3, 3)20.8. 

(e) (1, 0); (2, 0); (3; 0), @, 1), (3; 1), (8, 2)r 0.25. 


3. (a) If we sum f(x,y) over the 25 possible pairs of values (x,y), we obtain 40c. Since this sum must 
be equal to 1, it follows that c = 1/40. 


(b) f(0,-2) = ayia 1/20. 
(c) Pr(X = 3 f(y) = 7/40. 


y=—2 
(d) The answer is found by summing f(z,y) over the following pairs: (—2,—2), (—2,—1), (—1,—2), 
(—1,-1), (-1,0), (0,-1), (0, 0), (0, 1), (1, 0), (1, 1), (1, 2), (2, 1), and (2, 2). The sum is 0.7. 


ioe) ioe) 1 2 
4. (a) / / f(a,y)dxdy = 1 | cy’ dx dy = 2c/3. Since the value of this integral must be 1, it 
—oo J—co 0 JO 
follows that c = 3/2. 
(b) The region over which to integrate is shaded in Fig. 8.3.16. 
AY 


14 


| 

| 

| 

| 

| 

| a 
0 1 2 xX 


Figure $.3.16: Region of integration for Exercise 4b of Sec. 3.4. 


J [ tenacay 


shaded 


region 


Pr(X + Y > 2) 


(c) The region over which to integrate is shaded in Fig. S.3.17. 


AY 


Pol 


2 x 


Figure $.3.17: Region of integration for Exercise 4c of Sec. 3.4. 


1/2 
Pr(¥ < 5) = [ | = dy dx = 5. 
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Figure $.3.18: Region of integration for Exercise 4d of Sec. 3.4. 


(d) The region over which to integrate is shaded in Fig. 8.3.18. 


1 713 1 
Pr(X < 1) -| i ~y* dydx = =. 
o Jo 2 2 
(e) The probability that (X,Y) will lie on the line x = 3y is 0 for every continuous joint distribution. 


(a) By sketching the curve y = 1 — 2”, we find that y < 1 — x? for all points on or below this curve. 
Also, y > 0 for all points on or above the x-axis. Therefore, 0 < y < 1— x? only for points in the 
shaded region in Fig. $.3.19. 

i4y 


y =1-¥? 


Figure $.3.19: Figure for Exercise 5a of Sec. 3.4. 


Hence, 


co [oe] 1 1-22 4 
/ / f(x,y) dedy = | i} e(a* +y) dy da = ec: 
—o0 4 —00 -1/0 


Therefore, c = 5/4. 
(b) Integration is done over the shaded region in Fig. $.3.20. 


1 3 pie 5 79 

exe |\= = a ae. 

Pr (0 <X< >) / / f(a, y) dx dy z | ri + y) dy dx occ 
shaded 


region 


hy 


A > 
-1 0 1/2 1 x 


Figure 5.3.20: Region of integration for Exercise 5b of Sec. 3.4. 
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Figure $.3.21: Region of integration for Exercise 5c of Sec. 3.4. 


(c) The region over which to integrate is shaded in Fig. S.3.21. 


p(y<x+1) = ff feydedy=1-f | fle,ydedy 
shaded unshaded 


region region 


0 pl—s* 5 13 
= i-f f ~(x? + y) dydz = —. 
ees qe + y) dy Ts 


(d) The probability that (X,Y) will lie on the curve y = x? is 0 for every continuous joint distribution. 


6. (a) The region S is the shaded region in Fig. $.3.22. Since the area of S' is 2, and the joint p.d.f. is to 


Figure $.3.22: Figure for Exercise 6a of Sec. 3.4. 


be constant over S, then the value of the constant must be 1/2. 


(b) The probability that (X,Y) will belong to any subset So is proportional to the area of that subset. 
Therefore, 
1 1 a 
Pr [(X,Y) € So] =a) [se dy = ~(area of So) = —. 
SoJ 2 2 2 


7. (a) Pr(X < 1/4) will be equal to the sum of the probabilities of the corners (0, 0) and (0, 1) and 
the probability that the point is an interior point of the square and lies in the shaded region in 
Fig. $.3.23. The probability that the point will be an interior point of the square rather than one 


y 
(0,1) (1,1) 


0,0 _ 
a) 1/4 (1,0) x 


Figure $.3.23: Figure for Exercise 7a of Sec. 3.4. 
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of the four corners is 1 — (0.1+0.2+0.4+0.1) =0.2. The probability that it will lie in the shaded 
region, given that it is an interior point is 1/4. Therefore, 


1 1 
Pr (x ra :) = 0.1+0.4 + (0.2) (5) = 0.55. 


(b) The region over which to integrate is shaded in Fig. 8.3.24. 


¥ 
(0,1) 1,1) 


(0,0) 


Figure $.3.24: Figure for Exercise 7b of Sec. 3.4. 


1 
Pr(X +Y <1) =0.1+0.2+0.4 + (0.2) (5) =08. 


8. (a) Since the joint c.d.f. is continuous and is twice-differentiable in the given rectangle, the joint 
distribution of X and Y is continuous. Therefore, 


Prl<A’ <2and 12 Y¥<2) = Pril<xX<2and 1<Y <2)= 
24 6 10 2 5 


F (2,2) — F(1,2) — F(2,1) + FQ, 1) — 156. 156 156 156 78 


Pr2<X<4and 2<Y <4) = Pr@<X% <sand 24 Y <4) 
F(3,4) — F(2,4) — F(3, 2) + F(2,2) 
64 66 24 25 


156 156° 156 78° 
(c) Since y must lie in the interval 0 < y < 4, Fo(y) = 0 for y < 0 and Fo(y) = 1 for y > 4. For 


O<y<4, 
Fo(y) = lim F(z, y) = lim 2 ee? +y)= a (9+ y). 
Z—00 z>3 156 52 
(d) We have f(x,y) = 0 unless 0 <x <3 and0<y <4. In this rectangle we have 
e 
f(c,y) = =D — (ar? +24) 
(e) The region over which to integrate is shaded in Fig. S.3.25. 
Pry =X) = / / fla,y) dady = PL qo + 2y) dy dx = a 
0 Jo 156 208 
shaded 
region 


9. The joint p.d.f. of water demand X and electricy demand Y is in (3.4.2), and is repeated here: 


1/29204 if 4<a”< 200 and1<y< 150, 
jaya | YP fiszsmaat sus 


0 otherwise. 
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AY 
4 


_ 
x 


0 3 


Figure $.3.25: Figure for Exercise 8e of Sec. 3.4. 


We need to integrate this function over the set where x > y. That region can be written as {(x,y) : 
4<a < 200,1 < y < min{z,150}}. The reason for the complicated upper limit on y is that we require 
both y < x and y < 150. 


200 pmin{x,150} 1 200 4; =. 14 
| | ——_dydx = | min{x — 1, 149} ae 
4 1 29204 4 29204 


150 r— 1 200 149 
= aa | Riess a 
| 59204°" * Jisy 29204°" 
(a —1)2 |"? 50 x 149 
2 x 29204 29204 
e—4 
1492 — 32-7450 


= —— = 0.63505. 
58408 = 29204 
10. (a) The sum of f(x,y) over all x for each fixed y is 
SS (2y)? 
exp(—3y) } | —J— = exp(—3y) exp(2y) = exp(—y), 
«=0 : 


where the first equality follows from the power series expansion of exp(2y). The integral of the 
resulting sum is easily calculated to be 1. 
(b) We can compute Pr(X = 0) by integrating f(0, y) over all y: 


oo 0 
Pr(X =0) =[ Cu) exp(—3y)dy = . 


11. Let f(x,y) stand for the joint p.f. in Table 3.3 in the text for z = 0,1 and y = 1,2,3,4. 
(a) We are asked for the probability for the set {Y € {2,3}}N{X = 1}, which is f(1,2) + f(1,3) = 
0.166 + 0.107 = 0.273. 
(b) This time, we want Pr(X = 0) = f(0,1) + f(0,2) + f(0,3) + f(0,4) = 0.513. 


3.5 Marginal Distributions 


Commentary 


Students can get confused when solving problems like Exercises 7 and 8 in this section. They notice that the 
functional form of f(x,y) factors into gi(x)g2(y) for those (x,y) pairs such that f(x,y) > 0, but they don’t 
understand that the factorization needs to hold even for those (x,y) pairs such that f(x,y) = 0. When the 
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two marginal p.d.f.’s are both strictly positive on intervals, then the set of (x,y) pairs where fi (x) fo(y) > 0 
must be a rectangle (with sides parallel to the coordinate axes), even if the rectangle is infinite in one or 
more directions. Hence, it is a necessary condition for independence that the set of (x,y) pairs such that 
f(x,y) > 0 be a rectangle with sides parallel to the coordinate axes. Of course, it is also necessary that 
f(x,y) = gi(x)go(y) for those (x,y) such that f(x,y) > 0. The two necessary conditions together are 
sufficient to insure independence, but neither is sufficient alone. See the solution to Exercise 8 below for an 
illustration of how to illustrate that point. 


Solutions to Exercises 


1. The joint p.d.f. is constant over a rectangle with sides parallel to the coordinate axes. So, for each x, 
the integral over y will equal the constant times the length of the interval of y values, namely d — c. 
Similarly, for each y, the integral over x will equal the constant times the length of the interval of 
x values, namely b — a. Of course the constant k must equal one over the area of the rectangle. So 
k = 1/[(b—a)(d —c)]. So the marginal p.d.f.’s of X and Y are 


fora<a<6, 
ey = b-—a 
otherwise, 
1 
force <y<d, 
foly) = d—c 
0 otherwise. 


2. (a) For x =0,1,2, we have 


1 
jiaj= > fe9)= ap 


1 
(4x + 6) = (2x +4 3). 
a 15 


Similarly, for y = 0,1, 2,3, we have 


. i 
foly) = So f(z y) = 30 


x=0 


(b) X and Y are not independent because it is not true that f(x,y) = fi(x)fo(y) for all possible 
values of x and y. 


(3 +3y)= (1+), 


3. (a) For 0 < az < 2, we have 
i 1 
file) =f flew ay = 5. 
0 
Also, fi(a) = 0 for x outside the interval 0 < x < 2. Similarly, for 0 < y < 1, 
2 
fa(y) = f° Fle.) de = 3y?. 


Also, f2(y) = 0 for y outside the interval 0 < y < 1. 
(b) X and Y are independent because f(x,y) = fi(x) fo(y) for —oo <  < co and —c~o < y< oo. 
(c) We have 


1 1 1 
pr(x <1andY > 5) = ik f(x,y) dx dy 
2 0 J1/2 
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ae 


1 1 
[ fil) fo(y) dev dy 


[ fi(x) dx ., fo(y) dy=Pr(X <1) Pr (v = 5): 


Therefore, by the definition of the independence of two events (Definition 2.2.1), the two given 
events are independent. 

We can also reach this answer, without carrying out the above calculation, by reasoning as follows: 
Since the random variables X and Y are independent, and since the occurrence or nonoccurence 
of the event {X < 1} depends on the value of X only while the occurrence or nonoccurence of 
the event {Y > 1/2} depends on the value of Y only, it follows that these two events must be 
independent. 


The region where f(x,y) is non-zero is the shaded region in Fig. $.3.26. It can be seen that the 


Ay 
1 


Figure $.3.26: Figure for Exercise 4a of Sec. 3.5. 


possible values of X are confined to the interval -1 < X <1. Hence, f(z) = 0 for values of x 
outside this interval. For —1 < x < 1, we have 


Similarly, it can be seen from the sketch that the possible values of Y are confined to the interval 
0<Y <1. Hence, fo(y) = 0 for values of y outside this interval. For 0 < y < 1, we have 


(1-y)¥/? 
faly) =I , f(x,y) dx = (1 ~ y)3/2, 


—(1=y)¥? 


(b) X and Y are not independent because f(x,y) 4 fi(x) fo(y). 


5. (a) Since X and Y are independent, 


fey) =H=Purx j=e2and Y=y) = Pr Xk =a) Pry’ = 9) = pp. 
3 


(b) Pr(X =Y) = dS = Sow=oa 


(c) Pr(X > Y) = f(1, nee + f (3,0) + f(2,1) + f(3, 1) + f(3, 2) = 0.35. 


6. (a) Since X and Y are independent 


9 
f(x,y) = fil%) foly) = g(x) g(y) = ay for OR 2 < 2,05 y= 2 


0 otherwise. 


(b) Since X and Y have a continuous joint distribution, Pr(X = Y) = 0. 
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(c) Since X and Y are independent random variables with the same probability distribution, it must 
be true that Pr(X > Y) = Pr(Y > X). Since Pr(X = Y) = 0, it therefore follows that Pr(X > 


Ya 1/2. 
(d) Pr(X + Y <1) = Pr(shaded region in ‘is 


l-y 
-[['s F(a, y) da dy = T5355: 


Figure $.3.27: Figure for Exercise 6d of Sec. 3.5. 


7. Since f(x,y) = 0 outside a rectangle and f(x,y) can be factored as in Eq. (3.5.7) inside the rectangle 
(use hy(a) = 2x and ho(y) = exp(—y)), it follows that X and Y are independent. 


8. Although f(x,y) can be factored as in Eq. (3.5.7) inside the triangle where f(x,y) > 0, the fact that 
f(x,y) > 0 inside a triangle, rather than a rectangle, implies that X and Y cannot be independent. 
(Note that y > 0 should have appeared as part of the condition for f(x,y) > 0 in the statement of 
the exercise.) For example, to factor f(x,y) as in Eq. (3.5.7) we write f(x,y) = gi(x)ga(y). Since 
f(1/3,1/4) = 2 and f(1/6,3/4) = 3, it must be that g)(1/3) > 0 and go(3/4) > 0. However, since 
f (1/3, 3/4) = 0, it must be that either g,(1/3) = 0 or g2(3/4) = 0. These facts contradict each other, 
hence f cannot have a factorization as in (3.5.7). 


9. (a) Since f(x,y) is constant over the rectangle S and the area of S is 6 units, it follows that f(x,y) = 
1/6 inside S and f(x,y) = 0 outside S. Next, for0 <a < 2, 


lee) 4] 1 
=| fleudy= f a = 5 


Also, fi (a) = 0 otherwise. Similarly, for 1 < y < 4, 


Bl adit 
= = 5 
fo(y) : ae 
Also, fo(y) = 0 ieee Thus, the marginal distribution of both X and Y are uniform distri- 


butions. 
(b) Since f(x,y) = fi(x) fo(y) for all values of x and y, it follows that X and Y are independent. 


10. (a) f(x,y) is constant over the circle S in Fig. $.3.28. The area of S is 7 units, and it follows that 
f(x,y) = 1/a inside S and f(x,y) = 0 outside S. Next, the possible values of x range from —1 to 
1. For any value of x in this interval, f(x,y) > 0 only for values of y between —(1 — x?)!/? and 
(1—«?)'/?. Hence, for -1< 2 <1, 


(1—2?)1/2 il 2 
- Spat wee 
Ale) = faye gh) 


Also, fi(x) = 0 otherwise. By symmetry, the random variable Y will have the same marginal 
p.d.f. as X. 
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a, 


12. 
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x? + y2=1 


= 
Figure $.3.28: Figure for Exercise 10 of Sec. 3.5. 


(b) Since f(x,y) 4 fi(x) fo(y), X and Y are not independent. 
The conclusions found in this exercise in which X and Y have a uniform distribution over a circle 
should be contrasted with the conclusions found in Exercise 9, in which X and Y had a uniform 
distribution over a rectangle with sides parallel to the axes. 


Let X and Y denote the arrival times of the two persons, measured in terms of the number of minutes 
after 5 P.M. Then X and Y each have the uniform distribution on the interval (0, 60) and they are 
independent. Therefore, the joint p.d.f. of X and Y is 


1 
— for0<2<60,0<y<60, 
f(x,y) = 4 3600 " 


0 otherwise. 
We must calculate Pr(|X — Y| < 10), which is equal to the probability that the point (X,Y) lies in the 
shaded region in Fig. $.3.29. Since the joint p.d.f. of X and Y is constant over the entire square, this 


ry 
y 


60 


be 


0 10 60 x 


Figure $.3.29: Figure for Exercise 11 of Sec. 3.5. 


probability is equal to (area of shaded region) /3600. The area of the shaded region is 1100. Therefore, 
the required probability is 1100/3600 = 11/36. 


Let the rectangular region be R = {(x,y): 20 < 4% < 21, yo < y < yi} with xp and/or yo possibly —oo 
and 2; and/or y; possibly oo. For the “if” direction, assume that f(x,y) = hi(x)ho(y) for all (a, y) 
that satisfy f(x,y) > 0. Then define 
- = hy(a) if% <u< 2, 
i es 0 otherwise. 
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7 h ifyo<y<m, 
hi(y) _ ee Yo ¥y Y1 


otherwise. 


Then hi(x)hS(y) = hi(x)ho(y) = f(a,y) for all (x,y) € R and hi(e)h3(y) = 0 = f(y) for all 
(x,y) € R. Hence f(x,y) = hi (x)h3(y) for all (x,y), and X and Y are independent. 


For the “only if” direction, assume that X and Y are independent. According to Theorem 3.5.5, 
f(x,y) = hi(x)ho(y) for all (x,y). Then f(x,y) = hi(x)ha(y) for all (x,y) € R. 


13. Since f(x,y) = f(y, 2) for all (x, y), it follows that the marginal p.d.f.’s will be the same. Each of those 
marginals will equal the integral of f(x,y) over the other variable. For example, to find f;(x), note 
that for each x, the values of y such that f(x,y) > 0 form the interval [—V1 — x7, V1 — x2]. Then, for 


-l<¢<l, 
file) =f Fle)dy 
V1-22 
= / kay? dy 
—V1-2 
1l-z 
= kx? uv 
y=—V 1-22 


= 2Qka?(1 — 2?)9/? /3. 


14. The set in Fig. 3.12 is not rectangular, so X and Y are not independent. 


15. (a) Figure $.3.30 shows the region where f(x,y) > 0 as the union of two shaded rectangles. Although 
the region is not a rectangle, it is a product set. That is, it has the form {(z,y): a2 € A,y € B} 
for two sets A and B of real numbers. 


1.0 


0.8 


0.6 


0.4 


0.2 


0.0 


Figure $.3.30: Region of positive p.d.f. for Exercise 15a of Sec. 3.5. 
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(b) The marginal p.d.f. of X is 
1 
file) =f fle,u)dy = 
The marginal p.d.f. of Y is 
m1 8] 
=/ -d —dz =1 


1 
for 0 < y < 1. The distribution of Y is the uniform distribution on the interval (0, 1]. 


ifl<a<3, 
if6<ar2<8. 


Drew 


(c) The product of the two marginal p.d.f.’s is 


; ifl<a<3and0<y<1l, 
filz)foly)=4 3 if6<a2<8and0<y<\1, 
0 otherwise, 


which is the same as f(x,y), hence the two random variables are independent. Although the 
region where f(x,y) > 0 is not a rectangle, it is a product set as we saw in part (a). Although 
it is sufficient in Theorem 3.5.6 for the region where f(x,y) > 0 to be a rectangle, it is necessary 
that the region be a product set. Technically, it is necessary that there is a version of the p.d-f. 
that is strictly positive on a product set. For continuous joint distributions, one can set the p.d-f. 
to arbitrary values on arbitrary one-dimensional curves without changing it’s being a joint p.d_f. 


3.6 Conditional Distributions 


Commentary 


When introducing conditional distributions given continuous random variables, it is important to stress that 
we are not conditioning on a set of 0 probability, even if the popular notation makes it appear that way. 
The note on page 146 can be helpful for students who understand two-variable calculus. Also, Exercise 25 in 
Sec. 3.11 can provide additional motivation for the idea that the conditional distribution of X given Y = y 
is really a surrogate for the conditional distribution of X given that Y is close to y, but we don’t wish to 
say precisely how close. Exercise 26 in Sec. 3.11 (the Borel paradox) brings home the point that conditional 
distributions really are not conditional on the probability 0 events such as {Y = y}. 

Also, it is useful to stress that conditional distributions behave just like distributions. In particular, 
conditional probabilities can be calculated from conditional p.f.’s and conditional p.d.f.’s in the same way 
that probabilities are calculated from p.f.’s and p.d.f.’s. Also, be sure to advertise that all future concepts 
and theorems will have conditional versions that behave just like the marginal versions. 


Solutions to Exercises 


1. We begin by finding the marginal p.d.f. of Y. The set of x values for which f(x,y) > 0 is the interval 
[—(1 — y?)!/2, (1 — y?)!/2]. So, the marginal p.d.f. of Y is, for -1 <y <1, 


(1—y2)2/2 hae ed ok 
fay) -| a3 ° 


ka? yd = ——x =. 7 = yy. 
—(1-y?)1/2 x=—(1—y?)1/2 3 


and 0 otherwise. The conditional p.d.f. of X given Y = y is the ratio of the joint p.d.f. to the marginal 
p.d.f. just found. 
3x? 


gi(zly) = 4 2(1 — y?)3/2 
0 otherwise. 


for —(1—y?)¥/2 <a <(1—y?)1/2, 
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2. (a) We have Pr(Junior) = 0.04 + 0.20 + 0.09 = 0.33. Therefore, 


Pr(Junior and Never) 0.04 4 
Pr(Junior) 038° 33 
(b) The only way we can use the fact that a student visited the museum three times is to classify the 
student as having visited more than once. We have 


Pr(More than once) = 0.04 + 0.04 + 0.09 + 0.10 = 0.27. 
Therefore, 


Pr(Never|Junior) = 


Pr(Senior and More than once) 
Pr(More than once) 

0.10 10 

0.27 27 


3. The joint p.d.f. of X and Y is positive for all points inside the circle S shown in the sketch. Since the 
area of S is 97 and the joint p.d.f. of X and Y is constant over S, this joint p.d.f. must have the form: 


Pr(Senior|More than once) = 


1 
— for (x,y) €S, 
f(t,y)= 9 9m 


0 otherwise. 


(1,1) 


(-5,1) 


Figure $.3.31: Figure for Exercise 3 of Sec. 3.6. 


It can be seen from Fig. $.3.31 that the possible values of X lie between —2 and 4. Therefore, for 
—2<2 <4, 


—24[9-(a-1)7]/? 4 2 
= — dy = —|9 — (x —1)7]'/?. 
fil) = ff —ote_yme Be UT BOT 
(a) It follows that for —2 <x <4 and —2— [9 — (x—1)?]!/2 <y < -24+ [9 — (a — 1)?]!7, 


r _ flay) 1 fea 17-17 
polyl2) = FO = 39 - @ 11? 


(b) When X = 2, it follows from part (a) that 


1 
= foro oye 24/8 
g(y|a=2)= 4 2/8 


0 otherwise. 
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Therefore, 


-2+v8 2-2 
2V8 4° 


24/8 
Pr(¥ >0|X =2)= f alexis 
0 


4. (a) For0<y <1, the marginal p.d.f. of y is 
1 1 
fay) = [ fey)ar=e(5 +0). 
Therefore, for0O <x <land0<y <1, the conditional p.d.f. of X given that Y = y is 


u(o|y) = LEB - ote 
fey) 3+? 
It should be noted that it was not necessary to evaluate the constant c in order to determine this 
conditional p.d.f. 


(b) When Y = 1/2, it follows from part (a) that 


1 
1 5 (+3) for0O <a <1, 
n(ely=5) = 3 4 


0 otherwise. 


Therefore, 


1 1 3 1 1 
P xX =< y = = = = d =-, 
r( <3 5) [Pa (ely 5) de 
5. (a) The joint p.df. f(x,y) is given by Eq. (3.6.15) and the marginal p.d.f. fo(y) was also given in 
Example 3.6.10. Hence, for 0 << y<1and0< a < y, we have 
gi(xr|y) = = ————________.. 
I) = "Fy ~ Wa) log = 9) 


(b) When Y = 3/4, it follows from part (a) that 


for0<a< 3, 


1 

3 ee 

n(ely=4) = (1 x) log 4 
0 


otherwise. 
Therefore, 
1 3 3/4 3 log 4—log2 1 
Pr( X>-=|Y=-)= eS |) fp ee 
r( > 3! i) [. n (ely 1) " log 4 2 


6. Since f(x,y) = 0 outside a rectangle with sides parallel to the x and y axes and since f(x,y) can be 
factored as in Eq. (3.5.7), with gi(x) = csin() and go(y) = 1, it follows that X and Y are independent 
random variables. Furthermore, for 0 < y < 3, the marginal p.d.f. f(y) must be proportional to go(y). 
In other words, f2(y) must be constant for 0 < y < 3. Hence, Y has the uniform distribution on the 
interval [0,3] and 


2 for 0 <y <3, 
foty)=4 3 
0 


otherwise. 


(a) Since X and Y are independent, the conditional p.d.f. of Y for any given value of X is the same 
as the marginal p.d.f. fo(y). 
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(b) Since X and Y are independent, 
2 1 
Pr(l < Y <2|X =0.73) =Pr(1 <Y <2) -| falu) dy = 5. 
1 


7. The joint p.d.f. of X and Y is positive inside the triangle S shown in Fig. $.3.32. It is seen from 
Fig. $.3.32 that the possible values of X lie between 0 and 2. Hence, for 0 < x < 2, 


y 


Figure $.3.32: Figure for Exercise 7 of Sec. 3.6. 


hiz2)= [- f2a)dy= (2 ay" 


— 
& 
~~" 


It follows that forO<27<2and0<y<4-—2z, 
f(z,y) _ 4-2n-y 


g2(y |x) = fila) — (x — 2)? ° 
(b) When X = 1/2, it follows from part (a) that 
2 
1 ~(3-—y) for0<y <8, 
92 (u |z= 5) = 9 


0 otherwise. 


Therefore, 


1 3 1 1 
Pr(Y >2|xX==)]= == | ay =—. 
r(ve2x=5)=f a(vle=3) a= 5 


8. (a) The answer is 


1 pl 
| , f(z, y) dx dy = 0.264. 
0 /0.8 
(b) For 0 < y < 1, the marginal p.d.f. of Y is 


foly) = i f(z, y) dz = (1 + 3y). 


Hence, forO<a<landO<y<1, 


cath 2a + 3y 
x a : 
ans 1+ 3y 
When Y = 0.3, it follows that 
2 0.9 
gi(z|y = 0.3) a for 0) <a <1. 


1.9 
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Hence, 
1 
Pr(X > 0.8|¥ = 0.3) = | n(a|y =0.3)de = 0.284. 
0.8 
(c) For 0 <a <1, the marginal p.d-f. of X is 
1 2 3 
fi(z) -| f(z,y)dy=—-(2e+-5). 
0 9) 2 
Hence, forO <a<landO<y< 1, 


( | ) 2x + 3y 
xr) = aan 
GAY D 3 


When X = 0.3, it follows that 
0.6+ 3 
galy |x = 0.3) ae forO<y<1. 


Hence, 


1 
Pr(¥ > 0.8|X = 0.3) = | aye — OR ayH 0314: 
0.8 


9. Let Y denote the instrument that is chosen. Then Pr(Y = 1) = Pr(Y = 2) = 1/2. In this exercise the 
distribution of X is continuous and the distribution of Y is discrete. Hence, the joint distribution of X 
and Y is a mixed distribution, as described in Sec. 3.4. In this case, the joint p.f./p.d.f. of X and Y is 


as follows: 
1 
—~-2c%=2 for y=1land 0<2<l, 
2 
-2 1 3 
f(z,y) = 5 3a = Se" for y=2and 0<a<l, 
0) otherwise. 


(a) It follows that for 0 <a <1, 
: 3 
fi(z) = b> f(z,y) =o xf 
y=1 


and f(x) = 0 otherwise. 


(b) For y=1,2 and 0 < z < 1, we have 


PAY = y|X =a) = gly la) = SOW, 
Hence, 
1 1 
f (5.1) = 
1 a} 4 = 
Pr (Y ee *) ie a ae 
fi (3) 1°35 T6 


10. Let Y = 1 if a head is obtained when the coin is tossed and let Y = 0 if a tail is obtained. Then 
Pr(Y =1|X = 2) = 2 and Pr(Y = 0|X = x) = 1-vz. In this exercise, the distribution of X is 


Ts 


12. 
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continuous and the distribution of Y is discrete. Hence, the joint distribution of X and Y is a mixed 
distribution as described in Sec. 3.4. The conditional p.f. of Y given that X = x is 


By for y= 1, 
g(y|v)=4 1-a for y=0, 
0 otherwise. 


The marginal p.d.f. f(x) of X is given in the exercise, and the joint p.f./p.d.f. of X and Y is f(z, y) 
= f1(x)go(y |x). Thus, we have 


627(1—ax) for O<2<land y=1, 
f(z,y) = 62(1—2)* for O<a2<land y=O, 


0 otherwise. 


Furthermore, for y = 0,1, 


PY =v) = Alo) = [ fle)ae. 


Hence, 
1 1 : 1 
Pay == i 6x7(1 — x) dr = | (62? — 62°) dx = 5 
0 0 


(This result could also have been derived by noting that the p.d.f. fj (a) is symmetric about the point 
= 1/2.) 
It now follows that the conditional p.d.f. of X given that Y = 1 is, forO <a <1, 


ae eee _ 627(1— 2) 
n@ly=)= Baya = A 


= 199 Ch @), 
Let Fy be the c.d.f. of Y. Since fo is continuous at both yo and y,, we can write, for i = 0,1, 
Pr(Y € A;) = Fo(yi + ©) — Fo(yi — €) = 2€ fo(y'), 


where y/, is within € of y;. This last equation follows from the mean value theorem of calculus. So 


Pr(Y € Ao) _ fa(yo) 
Pr(Y € Ai) — fo(y{) 


(S.3.1) 


Since fo is continuous, lim fo(y:) = fo(yi), and the limit of (S.3.1) is 0/fo(y1) = 0. 


(a) The joint p.f./p.d.f. of X and Y is the product fo(y)gi (aly). 


_ J (2y)* exp(—3y)/2! ify >Oand2z=0,1,..., 
Fay) = 0 otherwise. 


The marginal p.f. of X is obtained by integrating over y. 


file) = f° OO exp(—ay)ay = 5 (3) 


tor C= 0) Do ccas 
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(b) The conditional p.d.f. of Y given X = 0 is the ratio of the joint p.f./p.d-f. to (0). 


_ (2y)° exp(—3y)/0! _ 
for y > 0. 
(c) The conditional p.d.f. of Y given X = 1 is the ratio of the joint p.f./p.d-f. to f,(1). 


(2y)* exp(—3y)/1! 
92(y|1) = aa = Sy exp(—3y), 
(1/3)(2/3)? 
for y > 0. 
(d) The ratio of the two conditional p.d.f.’s is 


g2(yll) _ 9yexp(—3y) _ by 
g2(y|0) — 3exp(—3y) 
The ratio is greater than 1 if y > 1/3. This corresponds to the intuition that if we observe more 


calls, then we should think the rate is higher. 


13. There are four different treatments on which we are asked to condition. The marginal p.f. of treatment 
Y is given in the bottom row of Table 3.6 in the text. The conditional p.f. of response given each 
treatment is the ratio of the two rows above that to the bottom row: 


ntait) = | EE p00 ies 
nto?) = | HE poser eos 
ales) = | HE oan tee 
atoll) = | BE psa ire 


The fourth one looks quite different from the others, especially from the second. 


3.7 Multivariate Distributions 


Commentary 


The material around Definition 3.7.8 and Example 3.7.8 reintroduces the concept of conditionally independent 
random variables. This concept is important in Bayesian inference, but outside of Bayesian inference, it 
generally appears only in more advanced applications such as expert systems and latent variable models. 
If an instructor is going to forego all discussion of Bayesian inference then this material (and Exercises 13 
and 14) could be skipped. 


Solutions to Exercises 


1. (a) We have 


Lele pl 
| | | hace dg, x3) dx dx dx3 = 3c. 
0 YO YO 


Since the value of this integral must be equal to 1, it follows that c = 1/3. 
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(b) ForO <a; <land 0<23 <1, 


1 1 
fi3(x1, £3) = [ J (215838) dxy = 3 Gal +14 323) : 


(c) The conditional p.d.f. of x3 given that 7; = 1/4 and x2 = 3/4 is, for 0 < x3 < 1, 


13 
1 *) F\ pp %s 7 12 


Therefore, 
il 1 3 s/f 12 5 
P X: i xX => 2. = = — — — 
. ( = ;| l= 3) : (= : 5) diz = 75 


1 
2. (a) First, integrate over 7;. We need to compute : cay **2 73 (1 — g1)3-*2-73 dz. The two exponents 


always add to 4 and each is always at least 1. So the possible pairs of exponents are (1,3), (2,2), 
and (3,1). By the symmetry of the function, the first and last will give the same value of the 
integral. In this case, the values are 


Cc Cc Cc 


1 
[ elz? — xi]dx = ieee (5.3.2) 


In the other case, the integral is 


1 
9 3 4 c 2 c c 

a) dg Sa 8.3.3 

[ ele? - 20 +afldes = $-F+E= 5 (8.3.3) 

Finally, sum over the possible (22,73) pairs. The mapping between (x2,x73) values and the expo- 


nents in the integral is as follows: 


Summing over the four possible (72,73) pairs gives the sum of c/6, so c = 6. 


(b) The marginal joint p.f. of (X2, X3) is given by setting c = 6 in (S.3.2) and (S.3.3) and using the 
above table. 


f (a Gy ) = 0.3 if (x2, x3) € {(0, 0), , ge 
23(©2, 13 0.2 if (x2,23) € {(1,0), (0, 1)}. 
(c) The conditional p.d.f. of X; given X2 = 1 and X3 = 1 is 1/0.3 times the joint p.f./p.d.f. evaluated 


at v9 = 73 = 1: 


(1 
(0 


’ 


_ J 2Oxz(1—-21) if0<a, <1, 
gi(xi|1,1) = 0 otherwise. 


3. The p.d.f. should be positive for all 2; > 0 not just for all x; > 1 as stated in early printings. This will 
match the answers in the back of the text. 
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(a) We have 


[o-e) [o-e) [o-e) 1 
| i | f(®1, £2, 23) dx, drg dx3 = =e. 
0 Jo Jo 6 


Since the value of this integral must be equal to 1, it follows that c = 6. If one used x; > 1 instead, 
then the integral would equal exp(—6)/6, so that c = 6 exp(6). 


(b) For x1 > 0, 73 > 0, 


fi3(£1, £3) 7 f (x1, £2,23) dzg = 3exp[—(z1 + 323)]. 


If one used x; > 1 instead, then for 2; > 1 and x3 > 1, f13(%1, 73) = 3exp(—a1 — 3x3 + 4). 


— 
io) 
wa 


It is helpful at this stage to recognize that the random variables X1, X2, and X3 are independent 
because their joint p.d.f. f (21, 22,23) can be factored as in Eq. (3.7.7); ie., for 2; > 0 (4 = 1, 2,3), 


f(@1, "2, 03) = (exp(—a1)) = (2 exp(—22))(3exp(—#g)). 
It follows that 
1 1 1 
Pr xy <1| Xo = 2, X65 = 1) = Pr xy <1) = fi(v1) dry =i exp(—21)dr1 =1—- > 
0 0 


This answer could also be obtained without explicitly using the independence of Xj, X2, and X3 
by calculating first the marginal joint p.d_f. 


f23(x2, £3) =[ f 1,840,083) dy, 


then calculating the conditional p.d-f. 


Fri, 2,1) 
x1 | = 2,23 = 1) = — 
gil 1| 2 3 ) fos(2,D) 


and finally calculating the probability 
1 
Pr(Xy < 1| Xo = 2,X3 = 1) =i gi(a1 | xo = 2,23 = Dadri: 
0 
If one used x; > 1 instead, then the probability in this part is 0. 


4. The joint p.d.f. f(x1, 22,23) is constant over the cube S. Since 


1 pl fl 
// dx dz2 dx3 = | i | dx dx2 dx3 = 1, 
S 0 JO JO 


it follows that f(#1, 272,73) = 1 for (#1, 22,23) € S. Hence, the probability of any subset of S will be 
equal to the volume of that subset. 


(a) The set of points such that (21 — 1/2)? + (x2 — 1/2)? + (a3 — 1/2)? < 1/4 is a sphere of radius 1/2 
with center at the point (1/2, 1/2,1/2,). Hence, this sphere is entirely contained with in the cube 
S. Since the volume of any sphere is 47r°/3, the volume of this sphere, and also its probability, is 
An(1/2)?/3 = 1/6. 

(b) The set of points such that cf + 73 + 73 < 1 is a sphere of radius 1 with center at the origin (0, 0, 
0). Hence, the volume of this sphere is 47/3. However, only one octant of this sphere, the octant 
in which all three coordinates are nonnegative, lies in S. Hence, the volume of the intersection of 


1 4 
the sphere with the set S, and also its probability, is re 3” = a 


6. 


9. 
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(a) The probability that all n independent components will function properly is the product of their 
nm 


individual probabilities and is therefore equal to II Di. 
i=1 
(b) The probability that all n independent components will not function properly is the product of 
n 


their individual probabilities of not functioning properly and is therefore equal to [[a —p;). The 


i=1 
n 


probability that at least one component will function properly is 1 — [[a — pi). 
i=1 
Since the n random variables 71,..., 2, arei.i.d. and each has the p.f. f, the probability that a particular 
variable X; will be equal to a particular value x is f(x), and the probability that all n variables will 
be equal to a particular value zx is [f(x)]". Hence, the probability that all n variables will be equal, 
without any specification of their common value, is }>,[f(x)]”. 


. The probability that a particular variable X; will lie in the interval (a, b) is p = [ f(x) dx. Since the 


variables X1,...,X, are independent, the probability that exactly i of these variables will lie in the 
interval (a, b) is (") p'(1—p)"*. Therefore, the required probability is 


3 ("Jora =p". 


i=k 


. For any given value x of X, the random variables Yj,...,Y, are i.i.d., each with the p.df. g(y|2). 


Therefore, the conditional joint p.d.f. of Yj ...,¥, given that X = x is 


ih 

— for0<y<2,i1=1,...,n, 
A(Qi,---,¥n |e) = 9(yi|2).--9(n|z)= 4 2” ‘ 
0 otherwise. 


The joint p.d.f. of X and Y,,...,Y;, is, therefore, 


1) 
—exp(—x) for 0<y<a (i=1,...,n), 
Wah, <<a |z)=4 i 

0 otherwise. 


This joint p.d.f. is positive if and only if each y; > 0 and «x is greater than every y;. In other words, x 
must be greater than m = max{y1,...,Yn}- 

(a) For y; > 0 (4 =1,...,n), the marginal joint p.d.f. of Y1,...,Y, is 
Hal Vigil) = / PQA xnktip |e) ae= | = exp(—2) dx = aT exp(—m). 
= wg ae n! 


(b) For y; > 0 (¢ =1,...,n), the conditional p.d.f. of X given that Y; = y;(i =1,...,n) is 


g(x] yi,.--,Yn) = F(@)RY1, ++ Yn |) -{ exp(—(a—m)) for z>m, 


- Go(Y1;--++Yn) 0 otherwise. 


(a) Since X; = X for i = 1,2, we know that X; has the same distribution as X. Since X has a 
continuous distribution, then so does X; for 7 = 1,2. 
(b) We know that Pr(X; = Xo) =1. Let A = {(21, 22) : 21 = xo}. Then Pr((X1, X2) € A) =1. 


However, for every function f, f (a1, £2)dx;dx2 = 0. So there is no possible joint p.d-f. 
A 
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The marginal p.d.f. of Z is 2exp(—2z), for z > 0. The coordinates of X are conditionally i.i.d. given 
Z = z with p.df. zexp(—zz2), for x > 0. This makes the joint p.d.f. of (Z, X) equal to 2z° exp(—z[2 + 
41 +--++425]) for all variables positive. The marginal joint p.d.f. of X is obtained by integrating z out 
of this. 


240 
gz, for all x; > 0. 


fo(a) = [- 22° exp(—2[2 +21 +--+ 25))dz = Q+a,+--4a5) 


(oe) 
Here, we use the formula | y* exp(—y)dy = k! from Exercise 12 in Sec. 3.6. The conditional p.d-f. of 


0 
Z given X = (71,...,%5) is then 


(2+ 2, +---+25)° 5 


gu(zle) = oT DP exp(—222 tan ++ + a5), 


for z > 0. 


Since Xj,..., X, are independent, their joint p.f., p.d.f., or p.f./p.d.f. factors as 


f £ix tee i) — fi(@1) ra ita 


where each f; is a p.f. or p.d.f. If we sum or integrate over all x; such that j ¢ {t1,...,%,} we obtain 
the joint p.f., p.d.f., or p.f./p.d.f. of Xj,,...,X;, equal to fi, (vi,)--- fi, (vi), which is factored in a way 
that makes it clear that X;,,...,Xj;, are independent. 


Let h(y,w) be the marginal joint p.d.f. of Y and W, and let ho(w) be the marginal p.d.f. of w. Then 


h(y, w) = | fy. w)de. 


ho(w) = | | fy.z-w)dedy, 
—- Totes ~ ewe 7 J suly,z\w)dz. 


Let f (21, 22,13, z) be the joint p.d.f. of (X1, X2, X3, Z). Let fi2(x1, 22) be the marginal joint p.d.f. of 
(X1, X2). The the conditional p.d.f. of X3 given (X1, X2) = (x1, £2) is 


J f(v1, 22,23, z)dz _ J g(x1|z)g(x2|z)g(x3|z) fo(z)dz _ alg 9(21|z)9(x2|z) fo(z) ° 
fi2(@1, £2) 7 fi2(@1, £2) = | of al?) fi2(@1, £2) iad 


According to Bayes’ theorem for random variables, the fraction in this last integral is go(z|v1, 72). Using 
the specific formulas in the text, we can calculate the last integral as 


ae 1 
; zexp(—zir3)5 (2 + 2 + 29)%27 exp(—z(2 + 21 + 29))dzx 


2+21+22)? sf” 
= erated f 2 exp(—z(2 + 21 + a2 + 23))dz 
0 


(2+ 21 + 22)* 6 _  8(2 +21 + 22)° 


2 (2+01+20+93)4 (2+21+22+23)*° 
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The joint p.d.f. of (X1, X2, X3) can be computed in a manner similar to the joint p.d.f. of (X 1, X2) and 
it is 


Font ) = 
L1, 02,23) = —————_.. 
123 1 2 3 (242, +22+23)4 


The ratio of f123(a1, 22,23) to fi2(#1, 2) is the conditional p.d.f. calculated above. 


14. (a) We can substitute x1] = 5 and x2 =7 in the conditional p.d.f. computed in Exercise 13. 


3(2+5+4+7)3 8232 

0 —  —— ————————— 

93( 3| ) (4547+ 2577 (14 + x3)4 

for 73 > 0. 
(b) The conditional probability we want is the integral of the p.d.f. above from 3 to oo. 
co 8232 2744 | 
—_<__ dz3 = —-——— = 0.5585. 
I (l4+23)! °° (14+ 23)? |,,—3 


In Example 3.7.9, we computed the marginal probability Pr(X3 > 3) = 0.4. Now that we have 
observed two service times that are both longer than 3, namely 5 and 7, we think that the 
probability of X3 > 3 should be larger. 


15. Let A be an arbitrary n-dimensional set. Because Pr(W = c) = 1, we have 


ue |v PH Age Ag) CA) eS sG 
Eee) CW) 0 otherwise. 
It follows that 


Pr((Xine-1 Xn) € AW = w) =| Pr(X1,.-.,Xn)€ A) ifw=e, 


0 otherwise. 
Hence the conditional joint distribution of X1,...,X, given W is the same as the unconditional joint 
distribution of X,,...,X,, which is the distribution of independent random variables. 


3.8 Functions of a Random Variable 


Commentary 


A brief discussion of simulation appears at the end of this section. This can be considered a teaser for the 
more detailed treatment in Chapter 12. Simulation is becoming a very important tool in statistics and applied 
probability. Even those instructors who prefer not to cover Chapter 12 have the option of introducing the 
topic here for the benefit of students who will need to study simulation in more detail in another course. 

If you wish to use the statistical software R, then the function runif will be most useful. For the purposes 
of this section, runif (n) will return n pseudo-uniform random numbers on the interval [0, 1]. Of course, either 
n must be assigned a value before expecting R to understand runif (n), or one must put an explicit value 
of n into the function. The following two options both produce 1,000 pseudo-uniform random numbers and 
store them in an object called unumbs: 


e unumbs=runif (1000) 


e n=1000 
unumbs=runif (n) 
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Solutions to Exercises 


1. The inverse transformation is 2 = (1—y)!/?, whose derivative is —(1—y)~/?/2. The p.d.f. of Y is then 


o(y) = £0. — of?) ~ 9) 7/2 = 50 y)¥?, 


for0O<y<l. 


2. For each possible value of x, we have the following value of y = 2? — 2: 


0 2 
7 
9 2 
7 
6 2 
7 
12 : 
7 


3. It is seen from Fig. 8.3.33 that as x varies over the interval 0 < x < 2,y varies over the interval 


Ay 


(1,1) 


Figure $.3.33: Figure for Exercise 3 of Sec. 3.8. 
0<y<1. Therefore, for 0<y <1, 
Gy) = Pr(Y <y)=Pr[X(2- X) <y] =Pr(x? — 2X > -y) 
= Pix? 22x 41319) =Prix 17 S1=y 
= Pr(xX —1<-VJ1l-—y)+Pr(xX -—1> V1-y) 
= Pr 


X<1-VJl-y)+Pr(xX >14+ /1-y) 
1-VI-y ] 2 

| dx 

0 


ie 
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= Pw ly, 
It follows that, for 0 < y <1, 


dG(y) 1 
gy) = a ~ 20 —-yi2 


. The function y = 4 — x? is strictly decreasing for 0 < « < 2. When x = 0, we have y = 4, and when 


x = 2 we have y = —4. Therefore, as x varies over the interval 0 < x < 2, y varies over the interval 
—4 <y <4. The inverse function is 2 = (4 — y)!/3 and 


da 1 
OM es SE (A ay \2/8 
a 34-9) 


Therefore, for —4 < y < 4, 


1 


od yy 


=| 1 
2 


ay) = f14- A) =-(4-y)'/, 5(4- ee 


. If y= ax +, the inverse function is z = (y — b)/a and dx/dy = 1/a. Therefore, 


1 
a 


dx 
dy 


av) =F [=u 9) 


ae ae 


. X lies between 0 and 2 if and only if Y lies between 2 and 8. Therefore, it follows from Exercise 3 that 


for2<y <8, 


ay) = 3 (2 oF spe a el 


(a) If y = 2’, then as x varies over the interval (0,1), y also varies over the interval (0,1). Also, 
g = yi/? endl dx/dy = y~'/?/2. Hence, for 0< y <1, 


dx 1 1 
as 1/2 ra arr —1/2 = —1/2 
gy) = fy )|— Fa =i soe 


(b) If y = —a, then as x varies over the interval (0,1), y varies over the interval (—1,0). Also, 
xz = —y'/3 and dx/dy = ee Hence, for —1 < y < 0, 


dx 1 
= £(—4Al8y| 2" | — = Ny) 28. 
Hy) = Fw a) = gly! 
(c) If y = x!/?, then as x varies over the interval (0,1), y also varies over the interval (0,1). Also, 
x =y’ and dx/dy = 2y. Hence, for 0 < y < 1, g(y) = f(y?)2y = 2y. 


. As x varies over all positive values, y also varies over all positive values. Also, x = y? and dx/dy = 2y. 


Therefore, for y > 0, 


g(y) = f(y?)(2y) = 2y exp(—y’). 


. The c.d.f. G(y) corresponding to the p.d.f. g(y) is, for 0 < y < 2, 


y y 1 
Gly) = | Fon i Sages oye 
0 0 8 8 


We know that the c.d.f. of the random variable Y = G~!(X) will be G. We must therefore determine 
the inverse function G—!. If X = G(Y) = Y3/8 then Y = G-1(X) = 2X". It follows that Y = 2XV/3, 
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For 0 < x < 2, the c.d.f. of X is 
ras | f(t\dt = [| stat = <2. 
0 0 2 4 


Therefore, by the probability integral transformation, we know that U = F(X) = X?/4 will have 
the uniform distribution on the interval [0,1]. Since U has this uniform distribution, we know from 
Exercise 8 that Y = 2U!/% will have the required p.d.f. g. Therefore, the required transformation is 
Yer] oC Ayr Soe. 


We can use the probability integral transformation if we can find the inverse of the c.d.f. The c.d.f. is, 
forO<y <1, 


Gly) -/ sde= 5 | (t+ Yat = 5? +9). 


The inverse of this function can be found by setting G(y) = p and solving for y. 


—1+ (1+ 8p)!/? 


1 
“(yr +y)=p; yr +y—-2p=0; y= ; 


2 


So, we should generate four independent uniform pseudo-random variables P,, P2, P3,P, and let Y; = 
[-14+(1+6P)¥*)/9 tori 1,9,3,4. 


Let X have the uniform distribution on [0,1], and let F be ac.d.f. Let F~'(p) be defined as the smallest 
x such that F(x) > p. Define Y = F~!(X). We need to show that Pr(Y < y) = F(y) for all y, First, 
suppose that y is the unique x such that F(x) = F(y). Then Y < y if and only if X < F(y). Since 
X has a uniform distribution Pr(X < F(y)) = F(y). Next, suppose that F(x) = F(y) for all x in the 
interval [a,b) of [a,b] with b > a, and suppose that F(x) < F(y) for all 2 < a. Then F~1(X) < y if 
and only if X < F(a) = F(y). Once again Pr(X < F(y)) = F(y). 


The inverse transformation is z = 1/t with derivative —1/t?. the p.d-f. of T is 
g(t) = f(1/t)/t? = 2exp(—2/t)/t?, 
for t > 0. 


Let Y = cX +d. The inverse transformation is x = (y — d)/c. Assumethat c > 0. The derivative of 
the inverse is 1/c. The p.d.f. of Y is 


ay) = fly — d/e)/e = [eb — a), fora < (y-a/e<b. 


It is easy to see that a < (y—d)/c < bif and only if ca+d< y < cb+d, so g is the p.d.f. of the uniform 
distribution on the interval [ca + d,cb + d]. If c < 0, the distribution of Y would be uniform on the 
interval [cb+d,ca+d]. If c = 0, the distribution of Y is degenerate at the value d, i.e., Pr(Y =d) = 1. 


Let F' be the c.d.f. of X. First, find the c.d.f. of Y, namely, for y > 0, 
Pr(Y < y) = Pr(X? <y) =Pr(-y'? < X <y') = Fly? — F(-y'””). 
Now, the p.d.f. of Y is the derivative of the above expression, namely, 


Hye) fe 
Qy1/2 Qy1/2 : 


FO =P?) — F(-y")] = 


This equals the expression in the exercise. 
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16. Because 0 < X < 1 with probability 1, squaring X produces smaller values. There are wide intervals 
of values of X that produce small values of X? but the values of X that produce large values of X? are 
more limited. For example, to get Y € [0.9, 1], you need X € [0.9487, 1], whereas to get Y € [0,0.1] (an 
interval of the same length), you need X € [0, 0.3162], a much bigger set. 


17. (a) According to the problem description, Y = 0 if X < 100, Y = X — 100 if 100 < X < 5100, and 
Y = 5000 if X > 5100. So, Y = r(X), where 


0 if x < 100, 
r(a) =< x—100 if 100 <x < 5100, 
5000 if x > 5100. 


(b) Let G be the c.d.f. of Y. Then G(y) = 0 for y < 0, and G(y) = 1 for y > 5000. For 0 < y < 5000, 


Pr{Y <¥) Prir(X) <y) 


= Pr(X < y+ 100) 


y+100 dy 
7 [ (1+)? 


1 


= 1- : 
y+ 101 
In summary, 
0 ify <0, 
Gy) =4{ 1-yAmr «if 0 <y < 5000, 
il if y > 5000. 


(c) There is positive probability that Y = 5000, but the rest of the distribution of Y is spread out in 
a continuous manner between 0 and 5000. 


3.9 Functions of Two or More Random Variables 


Commentary 


The material in this section can be very difficult, even for students who have studied calculus. Many textbooks 
at this level avoid the topic of general bivariate and multivariate transformations altogether. If an instructor 
wishes to avoid discussion of Jacobians and multivariate transformations, it might still be useful to introduce 
convolution, and the extremes of a random sample. The text is organized so that these topics appear early 
in the section, before any discussion of Jacobians. In the remainder of the text, the method of Jacobians is 
used in the following places: 


e The proof of Theorem 5.8.1, the derivation of the beta distribution p.d_f. 
e The proof of Theorem 5.10.1, the derivation of the joint p.d.f. of the bivariate normal distribution. 


e The proof of Theorem 8.3.1, the derivation of the joint distribution of the sample mean and sample 
variance from a random sample of normal random variables. 


e The proof of Theorem 8.4.1, the derivation of the p.d.f. of the t distribution. 
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Solutions to Exercises 


1. The joint p.d.f. of X1 and X2 is 


1 for0O <a, <1,0<22<1, 
0 otherwise. 


f (x1, 22) = 
By Eq. (3.9.5), the p.d.f. of Y is 


ay) = [fly 22d 


The integrand is positive only for 0 << y—z<1land0< z< 1. Therefore, for 0 < y < 1 it is positive 
only for 0 < z < y and we have 


g(y) = fo 1-de=y. 


For 1 < y < 2. the integrand is positive only for y — 1 < z <1 and we have 


2. Let f be the p.df. of Y = X, + X2 found in Exercise 1, and let Z = Y/2. The inverse of this 
transformation is y = 2z with derivative 2. The p.d.f. of Z is 


Az for 0 <2 < 1/2, 
g(z) = 2f(2z2)=4 41-2) for1/2<2z<1, 
0 otherwise. 


3. The inverse transformation is: 


tT, = Yi; 
r2 = ye/y1, 
z3 = y3/ye- 


Furthermore, the set S where 0 < x; < 1 for i = 1,2,3 corresponds to the set TJ’ where 0 < y3 < yo < 
y. <1. We also have 


Ox, Oxy Oxy 1 0 0 

Oy, Oy2 Oy3 i 

Azo Or, 0 yp i 
f=d6| 22 = SS age = SO | ee 

Oy, Oy2 Oy3 Yo YW Y1y2 

0x3 0x3 0x3 0 mes i 

Oy, Oy2 Oy3 Y2 YP 


Therefore, for 0 < y3 < y2 < yi <1, the joint p.d-f. of Y1, Yo, and Y3 is 


| 
YS 
foo 
Is 
Is 
eee 
a 


G(Y15 Y25 ¥3) 


pee 
Y1 Y2 Y1y2 Y1y2 
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4. As a convenient device, let Z = X,. Then the transformation from X; and X»> to Y and Z is a one- 
to-one transformation between the set S where 0 < x2; < 1 and 0 < zg < 1 and the set T where 
O<y<2z< 1. The inverse transformation is 


Ly, = &, 
oy 
t2 = -. 
z 
Therefore, 
_ y z = = 
J = det Dis, On = det i, 7 tia: 
dy Oz Zz 2 


For 0< y<z< 1, the joint p.d.f. of Y and Z is 


y o\ (4 
g(y, 2) = 1(z,4) | J | — (-+4) (=) 3 
z z] \z 
It follows that for 0 < y < 1, the marginal p.d.f. of Y is 
1 
oily) = f glu,2)d2 = 20 -), 
y 


5. As a convenient device let Y = Xo. Then the transformation from X, and X» to Y and Z is a one- 
to-one transformation between the set S where 0 < 21 < 1 and 0 < x < 1 and the set T where 
O0<y<land0< yz <1. The inverse transformation is 


TT] = YZ, 
T2 = Y-. 
Therefore, 


_ Ee 
J = det 7 le Yy. 


The region where the p.d.f. of (Z,Y) is positive is in Fig. $.3.34. For 0 < y < 1 and0 < yz < 1, the 


Li Ze 


Z 


Figure $.3.34: The region where the p.d.f. of (Z, Y) is positive in Exercise 5 of Sec. 3.9. 


joint p.d.f. of Y and Z is 


gy, 2) = f(yz,y)|J| = (yz + y)(y). 
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It follows that for 0 < z <1, the marginal p.d.f. of Z is 


ale) =f oly.2)dy = 5 +. 


Also, for z > 1, 


g2(z) =| gly, 2) dy = aalz+ 1). 


6. By Eq. (3.9.5) (with a change in notation), 
[o.e) 
(2) = | f(z—t,t)dt for -co<z<o. 
—oo 


However, the integrand is positive only for 0 < z—t <t< 1. Therefore, for 0 < z < 1, it is positive 
only for 2/2 <t < z and we have 


12) = Qdt= 2". 
z/2 


For 1 < z < 2, the integrand is positive only for z/2 <t < 1 and we have 


ge) = 2z dt = z(2 — z). 
z/2 


7. Let Z = —Xp. Then the p.d-f. of Z is 


_ J} exp(z) for z <0, 
falz) = 0 for z > 0. 


Since X, and Z are independent, the joint p.d.f. of X; and Z is 


_ Jj exp(-(@-—2z)) forz>0,z<0, 
ite) 0 otherwise. 


It now follows from Eq. (3.9.5) that the p.d.f. of Y = X, — Xo = X14 Z is 


ay) = | fly- 22d 


The integrand is positive only for y— z > 0 and z < 0. Therefore, for y < 0, 


y 


gly) = I. exp(—(y — 2z))dz = 5 oxP(y) 


Also, for y > 0, 


10. 


i. 


12: 
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. We have 
Pr(Yp > 0.99) = 1—Pr(¥%, < 0.99) 
= 1-Pr(All n observations < 0.99) 
1 — (0.99)". 


Next, 1 — (0.99)” > 0.95 if and only if 


(0.99)” < 0.05 or n  log(0.99) < log(0.05) 
log (0.05) 


> OBES) ~~ 298.1. 
or = 196(0.99) 


So, n > 299 is needed. 


. It was shown in this section that the joint c.d.f. of Yj and Y,, is, for —co < yt <Yn<o, 


G(y1; Yn) = [F(Yn)|" — LF’) — F)]”. 
Since F'(y) = y for the given uniform distribution, we have 
Pr(Y; < 0.1, ¥_ < 0.8) = G(0.1,0.8) = (0.8)" — (0.7). 
Pr(Y, < 0.1 and Y,, > 0.8) 
= Pr(¥; < 0.1) — Pr(¥, < 0.1 and Y, < 0.8). 
It was shown in this section that the p.d.f. of Y; is 
Gilg) =1-—[1-—2@)/- 
Therefore, Pr(Y, < 0.1) = Gi (0.1) = 1 — (0.9)". Also, by Exercise 9, 
Pry; < 0.1 and Y, < 0.8) =(0.8)" — (0.7)”. 
Therefore, 
Pr(Y; < 0.1 and Y, > 0.8) = 1 —(0.9)" — (0.8)" + (0.7)”. 


The required probability is equal to 


1 ii n n 
Pr (au nm observations < 5) + Pr (au nm observations > *) = (5) + (5) : 


This exercise could also be solved by using techniques similar to those used in Exercise 10. 


The p.d.f. hi(w) of W was derived in Example 3.9.8. Therefore, 


il 
Paw +09) = [. hi(w)dw = fe n(n — 1)w"~?(1 — w)dw 
1 — n(0.9)"-1 + (n — 1)(0.9)”. 
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16. 


Chapter 3. Random Variables and Distributions 


If X has the uniform distribution on the interval [0, 1], then aX +b (a > 0) has the uniform distribution 
on the interval [b,a + 6]. Therefore, 8X — 3 has the uniform distribution on the interval [—3,5]. It 
follows that if X1,...,X, form a random sample from the uniform distribution on the interval [0, 1], 
then the n random variables 8X1 — 3,...,8X, — 3 will have the same joint distribution as a random 
sample from the uniform distribution on the interval [—3, 5]. 


Next, it follows that if the range of the sample X1,..., Xn is W, then the range of the sample 8X, — 
3,...,8X,—3 will be 8W. Therefore, if W is the range of a random sample from the uniform distribution 
on the interval [0,1], then Z = 8W will have the same distribution as the range of a random sample 
from the uniform distribution on the interval [—3, 5]. 


The p.d.f. h(w) of W was given in Example 3.9.8. Therefore, the p.d.f. f(z) of Z = 8W is 


wo-n(3)-4- 22" (9). 
for —3 < 2 < 5. 


This p.d.f. g(z) could also have been derived from first principles as in Example 3.9.8. 


Following the hint given in this exercise, we have 


Gy) 


Pr(At least nm — 1 observations are < y) 


Pr(Exactly n — 1 observations are < y) + Pr(All n observations are < y) 
ny" !(1—y) ty" = ny"? — (n — Ly”. 


Therefore, for 0 < y <1, 
gy) = n(n—1)y"* —n(n— Ty? 
= n(n—1)y"*(1—y). 


It is a curious result that for this uniform distribution, the p.d.f. of Y is the same as the p.d.f. of the 
range W, as given in Example 3.9.8. There actually is intuition to support those two distributions 
being the same. 


For any n sets of real numbers Ay,...,A,, we have 


Pr(¥, € Aj,...,¥n € An) = Pr[ri(X1) € At,...,7n(Xn) € An] 
Pr [ri(X1) € Ay]... Pr[rn(Xn) € An] 
Pr(¥; € Ai)... Pr(Y¥n € An). 


Therefore, Yj,..., ¥, are independent by Definition 3.5.2. 


If f factors in the form given in this exercise, then there must exist a constant c > 0 such that the 
marginal joint p.d.f. of X; and X9 is 


fio(v1, 22) =cg(a1,22) for (#1,22) € R’, 


the marginal joint p.d.f. of X3,X4, and Xz is 


1 
f345(%3,24,25) = qhl@a, £4, 25) for (x3,%4,25) € R°, 
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and, therefore, for every point (x,...,25) € R° we have 


Pies 8G) = Fiat Bie Va) fads (5, ay Vs) 


It now follows that for any sets of real numbers A, and Ag, 


Pr(¥, € A; and Yo € Ap) =f... / PPiges.g 0g UI on3 Org 


r1(v1,"2)€A1 and 
r2(@3,04,05)E€A2 


z i} fi2(x1, v2) dary ye i f345(23, £4, £5) dx3 dx4 drs 


r1(x1,@2)E Aq r2(x3,04,05)€ Ag 


= =Pryy E A;)Pr(¥2 E Ao). 


Therefore, by definition, Y; and Y2 are independent. 
17. We need to transform (X,Y) to (Z,W), where Z = XY and W =Y. The joint p.d-f. of (X,Y) is 


0 otherwise. 


f(x,y) = yexp(—ry)faly) ifa > 0, 


The inverse transformation is = z/w and y = w. The Jacobian is 


1 = act V/w —2/w" =, 


0 L w 
The joint p.d.f. of (Z,W) is 
g(z,w) = f(z/w, w)/w = wexp(—z) fo(w)/w = exp(—z) fa(w), for z > 0. 
This is clearly factored in the appropriate way to show that Z and W are independent. Indeed, if we 


integrate g(z,w) over w, we obtain the marginal p.d.f. of Z, namely gi(z) = exp(—z), for z > 0. This 
is the same as the function in (3.9.18). 


18. We need to transform (X,Y) to (Z,W), where Z = X/Y and W =Y. The joint p.d-f. of (X,Y) is 


I Se bayer ae>, 
f(t,y) = 0 otherwise. 


The inverse transformation is x = zw and y = w. The Jacobian is 


w Zz 
=a (4 {)=u 


The joint p.d.f. of (Z,W) is 
g(z,w) = f(zw,w)w = 3z7w fo(w)w/w? = 327 fo(w), forO< a2 <1. 


This is clearly factored in the appropriate way to show that Z and W are independent. Indeed, if we 
integrate g(z,w) over w, we obtain the marginal p.d.f. of Z, namely g(z) = 327, for 0< z <1. 
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19. This is a convolution. Let g be the p.d.f. of Y. By (3.9.5) we have, for y > 0, 


ay) = f fy-2)F (dx 
y 


| e* Ye" *dz 
0 


= ye": 


Clearly, g(y) = 0 for y < 0, so the p.d.f.of Y is 


_ jge*™ tory? 0, 
Hy) = 0 otherwise. 


20. Let f; stand for the marginal p.d.f. of X,, namely fi(x) = f f(x, 22)dx2. With ag = 0 and a; =a in 
(3.9.2) we get 


g(y) 


l| 
(ae 
Qg 8 
SY 
—— 
a 
Q] | 
oa 
8 
bo 
ee 
a. 
a 
bo 


| 
S|H 
ah 
end 
Q] | 
oa 
ee 


which is the same as (3.8.1). 


21. Transforming to Z; = X1/X_ and Z2 = Xj, has the inverse Xj = Z and X_ = Z2/Z,. The set of 
values where the joint p.d.f. of Z; and Z, is positive is where 0 < zg < 1 and 0 < 22/z; < 1. This can 
be written as 0 < z2 < min{1, 2}. The Jacobian is the determinant of the matrix 


0 1 
29/22 1f/z, }’ 


which is |z2/z?|. The joint p.d.f. of Z; and Zy is then 


22 
Azo = 42323, 


22 
2 Z1 


9(21, 22) = 
a 


1 


for 0 < z < min{1, 2}. Integrating z2 out of this yields, for z; > 0, 


in{1, 3 

min{1,z} 2 
A dz 

0 zy 


min{2,1}4 


3 
cal 


= ie if z7 <1, 


gi(21) 


a ae 


This is the same thing we got in Example 3.9.11. 
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3.10 Markov Chains 


Commentary 


Instructors can discuss this section at any time that they find convenient or they can omit it entirely. 
Instructors who wish to cover Sec. 12.5 (Markov chain Monte Carlo) and who wish to give some theoretical 
justification for the methodology will want to discuss some of this material before covering Sec. 12.5. On 
the other hand, one could cover Sec. 12.5 and skip the justification for the methodology without introducing 
Markov chains at all. 

Students may notice the following property, which is exhibited in some of the exercises at the end of this 
section: Suppose that the Markov chain is in a given state s; at time n. Then the probability of being in a 
particular state s; a few periods later, say at time n+ 3 or n+ 4, is approximately the same for each possible 
given state s; at time n. For example, in Exercise 2, the probability that it will be sunny on Saturday is 
approximately the same regardless of whether it is sunny or cloudy on the preceding Wednesday, three days 
earlier. In Exercise 5, for given probabilities on Wednesday, the probability that it will be cloudy on Friday is 
approximately the same as the probability that it will be cloudy on Saturday. In Exercise 7, the probability 
that the student will be on time on the fourth day of class is approximately the same regardless of whether 
he was late or on time on the first day of class. In Exercise 10, the probabilities for n = 3 and n = 4, are 
generally similar. In Exercise 11, the answers in part (a) and part (b) are almost identical. 

This property is a reflection of the fact that for many Markov chains, the nth power of the transition 
matrix P” will converge, as n — oo, to a matrix for which all the elements in any given column are equal. 
For example, in Exercise 2, the matrix P” converges to the following matrix: 


wir wip 
wir wire 


This type of convergence is an example of Theorem 3.10.4. This theorem, and analogs for more com- 
plicated Markov chains, provide the justification of the Markov chain Monte Carlo method introduced in 
Sec. 12.5. 


Solutions to Exercises 


1. The transition matrix for this Markov chain is 


wlrm wl 
wl wl 


(a) If we multiply the initial probability vector by this matrix we get 


P (55+55 oa +55) (5 >) 
Uv = = = SS = = = mar) ee . 
23° 22°23. 23 2°2 


(b) The two-step transition matrix is P?, namely 


i) 

e 

e 

Nw 

i) 

i) 

He 

fae 
Ole oO] o 
OlLool!] ss» 


2. (a) 0.4, the lower right corner of the matrix. 
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(b) (0.7)(0.7) = 0.49. 
(c) The probability that it will be cloudy on the next three days is (0.4)(0.4) 
(0.4) = 0.064. The desired probability is 1 — 0.064 = 0.936. 


3. Saturday is three days after Wednesday, so we first compute 


Pe 0.667 0.333 
| 0.666 0.334 |” 


Therefore, the answers are (a) 0.667 and (b) 0.666. 
4. (a) From Exercise 3, the probability that it will be sunny on Saturday is 0.667. Therefore, the answer 
is (0.667) (0.7) = 0.4669. 
(b) From Exercise 3, the probability that it will be sunny on Saturday is 0.666. Therefore, the answer 
is (0.666) (0.7) = 0.4662. 
5. Let v = (0.2, 0.8). 
(a) The answer will be the second component of the vector vP. We easily compute vP = (0.62, 0.38), 
so the probability is 0.38. 


(b) The answer will be the second component of vP?. We can compute vP? by multiplying vP by 
P to get (0.662, 0.338), so the probability is 0.338. 


(c) The answer will be the second component of of vP*. Since vP? = (0.6662, 0.3338), the answer is 
0.3338. 


6. In this exercise (and the next two) the transition matrix P is 


Late On time 


Late 


On time 


(a) (0.8)(0.5)(0.5) = 0.2 
(b) (0.5)(0.2)(0.2) = 0.02. 


7. Using the matrix in Exercise 6, it is found that 


pee. 0.368 0.632 
~ | 0.395 0.605 | ° 


Therefore, the answers are (a) 0.632 and (b) 0.605. 
8. Let v = (0.7,0.3). 


(a) The answer will be the first component of the vector vP. We can easily compute vP = (0.29, 0.71), 
so the answer is 0.29. 


(b) The answer will be the second component of the vector vP?. We compute vP? = (0.3761, 0.6239), 
so the answer is 0.6239. 
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9. (a) It is found that 


, 2 FZ 
1% 16 16 16 
0 1 0 0 
=| 4 4 % 2 
B 6: 8 6 
4 6 3 38 
16 16 16 16 


The answer is given by the element in the third row and second column. 
(b) The answer is the element in the first row and third column of P?, namely 0.125. 


pea) 


10. tebe ee 
nee Ga 


(a) The probabilities for s1, 2,53, and s4 will be the four components of the vector vP. 
(b) The required probabilities will be the four components of vP?. 


(c) The required probabilities will be the four components of vP?. 


11. The transition matrix for the states A and B is 


wir wile 
wlrR wl be 


It is found that 


41 40 
4_ | 81 81 
ite 40 41 
81 81 


40 Al 
Therefore, the answers are (a) al and (b) ai 


12. (a) Using the transition probabilities stated in the exercise, we construct 


0.0 0.2 0.8 
P=! 06 0.0 0.4 
0.5 0.5 0.0 


(b) It is found that 


0.52 0.40 0.08 
P? =|! 0.20 0.32 0.48 
0.30 0.10 0.60 


111 
Let v = (5. 3 5): The probabilities that A,B, and C’' will have the ball are equal to the three 
components of vP?. Since the third component is largest, it is most likely that C' will have the 
ball at time n+ 2. 
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The states are triples of possible outcomes: (HHH), (HHT), (HTH), etc. There are a total of eight 
such triples. The conditional probabilities of the possible values of the outcomes on trials (n—1,n,n+1) 
given all trials up to time n depend only on the trials (n — 2,n — 1,n) and not on n itself, hence we 
have a Markov chain with stationary transition probabilities. Every row of the transition matrix has 
the following form except the two corresponding to (HHH) and (TTT). Let a,b,c stand for three 
arbitrary elements of {H,7T}, not all equal. The row for (abc) has 0 in every column except for the two 
columns (abH) and (abT), which have 1/2 in each. In the (HHH) row, every column has 0 except the 
(HHT) column, which has 1. In the (TTT) row, every column has 0 except the (TT-H) column which 
has 1. 


Since we switch a pair of balls during each operation, there are always three balls in box A during this 
process. There are a total of nine red balls available, so there are four possible states of the proposed 
Markov chain, 0, 1, 2, 3, each state giving the number of red balls in box A. The possible compositions 
of box A after the nth operation clearly depend only on the composition after the n — Ist operation, so 
we have a Markov chain. Also, balls are drawn at random during each operation, so the probabilities of 
transition depend only on the current state. Hence, the transition probabilities are stationary. If there 
are currently 0 red balls in box A, then we shall certainly remove a green ball. The probability that we 
get a red ball from box B is 9/10, otherwise we stay in state 0. So, the first row of P is (1/10, 9/10, 0,0). 
If we start with 1 red ball, then we remove that ball with probability 1/3. We replace whatever we 
draw with a red ball with probability 8/10. So we can either go to state 0 (probability 1/3 x 2/10), 
stay in state 1 (probability 1/3 x 8/10 + 2/3 x 2/10), or go to state 2 (probability 2/3 x 8/10). The 
second row of P is (1/15, 2/5,8/15,0). If we start with 2 red balls, we remove one with probability 2/3 
and we replace it with red with probability 7/10. So, the third row of P is (0,1/5, 17/30, 7/30). If we 
start with 3 red balls, we certainly remove one and we replace it by red with probability 6/10, so the 
fourth row of P is (0,0, 2/5,3/5). 


We are asked to verify the numbers in the second and fifth rows of the matrix in Example 3.10.6. For 
the second row, the parents have genotypes AA and Aa, so that the only possible offspring are AA and 
Aa. Each of these occurs with probability 1/2 because they are determined by which allele comes from 
the Aa parent. Since the two offspring in the second generation are independent, we will get {AA, AA} 
with probability (1/2)? = 1/4 and we will get {Aa, Aa} with probability 1/4 also. The remaining 
probability, 1/2, is the probability of {AA, Aa}. For the fifth row, the parent have genotypes Aa and 
aa. The only possible offspring are Aa and aa. Indeed, the situation is identical to the second row with 
a and A switched. The resulting probabilities are also the same after this same switch. 


We have to multiply the initial probability vector into the transition matrix and do the arithmetic. For 
the first coordinate, we obtain 


1 1 1 9 
=~ 1420254 x 0.0625 = —. 
eo “a a 64 


The other five elements are calculated in a similar fashion. The resulting vector is 


( 9 3 1 5 3 9 ) 
64’ 16’ 32’ 16’ 16’ 64/ © 
(a) We are asked to find the conditional distribution of X, given X,_; = {Aa,aa} and X41 = 
{AA,aa}. For each possible state x,, we can find 
Pr Xe = tel Aner = (AG, aa}, Xan = {AA aa} (S.3.4) 
Pr Xe = Gy An = {AA oa} |Xy 1 = {Ae,aa}) 
Pr(Xn41 = {AA, aa}|Xn_1 = {Aa, aa}) 
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The denominator is 0.0313 from the 2-step transition matrix in Example 3.10.9. The numerator 
is the product of two terms from the 1-step transition matrix: one from {Aa,aa} to x, and the 
other from x, to {AA,aa}. These products are as follows: 
In 
{AA,AA} {AA,Aa} {AA,aa}  {Aa,Aa} {Aa,aa} {aa,aa} 
0 0 0 0.25 x 0.125 0 0 
Plugging these into (8.3.4) gives 
Pr(X, = {Aa, Aa}|Xp41 = {Aa, aa}, Xn41 = {AA, aa}) = 1, 
and all other states have probability 0. 
(b) This time, we want 
Pr ky = ty Xn = (AG, ae), Ana = { a0, 0a}) 
Pi Xe = Ga Xa = (0, oo} Ag = {AG ao}) 
Pr(Xnii = {aa,aa}|X,1 = {Aa,aa}) 


The denominator is 0.3906. The numerator products and their ratios to the denominator are: 


Diy {AA,AA} {AA,Aa} {AA,aa} {Aa, Aa} {Aa,aa}  {aa,aa} 
Numerator 0 0 0 0.25 x 0.0625 0.50.25 0.25 x 1 
Ratio 0 0 0 0.0400 0.3200 0.6400 


This time, we get 


0.04 if x, = {Aa, Aa}, 
Prot, = t_|\ Ani = (AG, 00}, Xn = (Ao, Ac}) =<. 0.82 ta, = {Aa,oa}, 
0.64 if x, = {aa, aa}, 


and all others are 0. 


18. We can see from the 2-step transition matrix that it is possible to get from every non-absorbing state 
into each of the absorbing states in two steps. So, no matter what non-absorbing state we start in, 
the probability is one that we will eventually end up in one of absorbing states. Hence, no distribution 
with positive probability on any non-absorbing state can be a stationary distribution. 


19. The matrix G and its inverse are 
—0.3 1 
= ( 0.6 1 ) 
10 1 —1 
a 
"= 9 e Tae 


The bottom row of Gu! is (2/3,1/3), the unique stationary distribution. 


20. The argument is essentially the same as in Exercise 18. All probability in non-absorbing states eventu- 
ally moves into the absorbing states after sufficiently many transitions. 


3.11 Supplementary Exercises 


Solution to Exercises 
1. We can calculate the c.d.f. of Z directly. 
F(a) = Presz) =]Pre=]X Pix <2) Pug j|Y Pry <2) 
= 5 P(X <z)+ 5 Py < 2) 
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The graph is in Fig. $.3.35. 


yo BF ®D Ow 
Ley 7 7. 


> 
3 4 5 z 


fo) 
=. 
ye) 


Figure $.3.35: Graph of c.d.f. for Exercise 1 of Sec. 3.11. 


2. Let x1,...,2,% be the finitely many values for which f;(x) > 0. Since X and Y are independent, the 
conditional distribution of Z = X + Y given X = x is the same as the distribution of « + Y, which 
has the p.d.f. fo(z — x), and the c.d.f. F(z — x). By the law of total probability the c.d.f. of Z is 
yok, Fo(z — 2) fi(x;). Notice that this is a weighted average of continuous functions of z, F(z — 2;) 
fori =1,...,k, hence it is a continuous function. The p.d.f. of Z can easily be found by differentiating 
the c.d.f. to obtain 7*_, fo(z — 2) f1 (ai). 


3. Since F(x) is continuous and differentiable everywhere except at the points x = 0, 1, and 2, 


F(x) { 


1 1 
0 1 2 


a 


Figure $.3.36: Graph of c.d.f. for Exercise 3 of Sec. 3.11. 


2 
5 forO<a <1, 
dF (x) 
3 
= dz 5 for 1 < ee, 
0 otherwise. 


A. Since f(x) is symmetric with respect to = 0, F'(0) = Pr(X <0) =0.5. Hence, 


I inde = sf exp(—2) dx = .4. 


It follows that exp(—zo) = .2 and zp = log 5. 


5. X, and X2 have the uniform distribution over the square, which has area 1. The area of the quarter 
circle in Fig. $.3.37, which is the required probability, is 7/4. 


10. 
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Xo 4 


f 
Ty, 


0 1 X4 


Figure $.3.37: Region for Exercise 5 of Sec. 3.11. 


(a) Pr(X divisible by n) = f(n) + f(2n) + f(8n)+--- = ay = a 
—o p) (na) n 


(b) By part (a), Pr(X even) = 1/2?. Therefore, Pr(X odd) = 1 — 1/2?. 


Pr(X + X2 even) = Pr(Xj even) Pr(X2 even) + Pr(X, odd) Pr(X2 odd) 
_ A571 ao 
~ & a “a , 
1 


1 


. Let G(x) devote the c.d.f. of the time until the system fails, let A denote the event that component 1 


is still operating at time x, and let B denote the event that at least one of the other three components 
is still operating at time xz. Then 


1 — G(a) = Pr(System still operating at time 2) = Pr(AN B) = Pr(A) Pr(B) = [1 — F(2)|[1 — F3(2)). 


Hence, G(x) = F(a) (1+ F?(«) — F3(a)). 


. Let A denote the event that the tack will land with its point up on all three tosses. Then Pr(A|X = 


x) = x3. Hence, 


1 


Pr(A) = ie x f(a) de = =. 


Let Y denote the area of the circle. Then Y = 7X7, so the inverse transformation is 


dx i] 
= 1/2 Gy fa 
x (y/7) an x We 


Also, if0 <a < 2, then 0 < y < 47. Thus, 


l 1/2 
IY) = eye (2) +i 


and g(y) = 0 otherwise. 


for O0<y< 4a 
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F(x) = 1-—exp(—2z) for x > 0. Therefore, by the probability integral transformation, F(X) will have 
the uniform distribution on the interval [0,1]. Therefore, 


Y = 5F(X) = 5(1 — exp(—2X)) 


will have the uniform distribution on the interval [0,5]. 


It might be noted that if Z has the uniform distribution on the interval [0,1], then 1 — Z has the same 
uniform distribution. Therefore, 


Y = 5[1 — F(X)] = 5exp(-2X) 
will also have the uniform distribution on the interval [0, 5]. 


This exercise, in different words, is exactly the same as Exercise 7 of Sec. 1.7 and, therefore, the solution 
is the same. 


Only in (c) and (d) is the joint p.d.f. of X and Y positive over a rectangle with sides parallel to the 
axes, so only in (c) and (d) is there the possibility of X and Y being independent. Since the uniform 
density is constant, it can be regarded as being factored in the form of Eq. (3.5.7). Hence, X and Y 
are independent in (c) and (d). 


The required probability p is the probability of the shaded area in Fig. $.3.38. Therefore, 


Ya 
1 
2 ASS 
ASSASSINS 
Or <M 
ESSERE STIS SESS S ISS : 
0 1 Xx 


Figure $.3.38: Figure for Exercise 14 of Sec. 3.11. 


/ 
p=i-p(ay=1— ff” fle) flwdedy =1-1/8 = 2/3, 


This problem is similar to Exercise 11 of Sec. 3.5, but now we have Fig. 5.3.39. The area of the shaded 


1337.5 
region is now 550 + 787.5 = 1337.5. Hence, the required probability is 3600 3715 


For 0< a <1, 


1 
fle) = [ 2a +y)dy = 14 2x — 32°. 


1 me 1.1.1.5 

Theref Pr{X <=]= dy -=-+---=-. 

erefore, r( = 5) [ fi(x) dx ag ——] 
Finally, for 0 < a,y <1, 


goly |x) = £60) __2e+y) 


filz) 1422 — 322° 
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60 


> 
10 60 A 


Figure $.3.39: Figure for Exercise 15 of Sec. 3.11. 


9y? 
i. fit) =f z=ey |) = < for) << ee 1, 


Hence, 


1 
holy) = / f(a,y)dx = —9y?log(y) for 0<y<1 


and 


_f@y_ 1 
Oa. cae 


for O<y<a<l. 


18. X and Y have the uniform distribution over the region shown in Fig. $.3.40. The area of this region is 


cam | 


Figure $.3.40: Region for Exercise 18 of Sec. 3.11. 


4. The area in the second plus the fourth quadrants is 1. Therefore, the area in the first plus the third 
quadrants is 3, and 


3 
Pr(XY¥ > 0) = 5. 


102 Chapter 3. Random Variables and Distributions 


Furthermore, for any value of x (—1 < x < 1), the conditional distribution of Y given that X = x will 
be a uniform distribution over the interval [x — 1,2 + 1]. Hence, 


1 

— forz-—l<y<e2x4+1, 
gly|z)=4 2 

0 otherwise. 


19. 


—s 
mn 
— 
= 
I 


1 pl 
[ f 6 azdy=3—60 +30? = 30-2)? tor 0 < ¢< I, 
a Jy 


y pl 
folg) = J [ 6 dzde = 6y0 ~y) for ) <4 <1, 
O vy 


a 
—— 
xR 
be nat 
I 


zy 
| [6 drdy =32 for O0<z< 1. 
0 JO 


20. Since f(x,y, z) can be factored in the form g(x, y)h(z) it follows that Z is independent of the random 
variables X and Y. Hence, the required conditional probability is the same as the unconditional 
probability Pr(3X > Y). Furthermore, it follows from the factorization just given that the marginal 
p.d.f. h(z) is constant for 0 < z <1. Thus, this constant must be 1 and the marginal joint p.d.f. of X 
and Y must be simply g(x,y) = 2, for0 <x <y <1. Therefore, 


1 py y) 
Prix > Y) -| 2dxdy = =. 
0 Jy/3 3 


The range of integration is illustrated in Fig. $.3.41. 


Ya 
1 


a 
xX 


0 1 


Figure $.3.41: Range of integration for Exercise 20 of Sec. 3.11. 


exp(—(t7+y)) forz>0, y>0, 
at 1) P(a,y) = 0 a. otherwise. 
Also, x = uv and y = (1 —u)v, so 
j= | a =vu>0 
—v l-wu 
Therefore, 


vexp(—v) forO<u<l,v>0, 
g(u,v) = f(wv, 1 = uv) || = a 
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(b) Because g can be appropriately factored (the factor involving u is constant) and it is positive over 
an appropriate rectangle, it follows that U and V are independent. 


22. Here, x = uv and y = v, so 


J = =vu> 0. 


U 
1 


v 
0 
Therefore, 


suv? forO<u<1,0<v<1, 


g(u,) = Fluo) = : 


otherwise. 


X and Y are not independent because f(x,y) > 0 over a triangle, not a rectangle. However, it can be 
seen that U and V are independent. 
_ dF (2x) 


23. Here, f(x) = [ae exp(—zx) for x > 0. It follows from the results in Sec. 3.9 that 


9(Y1, Yn) = n(n — 1)(exp(—y1) — exp(—yn))"-* exp(—(41 + Yn) 


for 0 < y1 < Yn. Also, the marginal p.d.f. of Y,, is 


9n(Yn) = N(1 — exp(—Yn))” 1 exp(—yn) for Yn > 0. 
Hence, 


(n — 1)(exp(—y1) — exp(—yn))"~? exp(—y1) 


(= exp(-yn)y eS TU 


A(t | Yn) = 
24. As in Example 3.9.7, let W = Y, — Y; and Z = Y;. The joint p.d-f. g(w, z) of (W, Z) is, forO<w<1 
and0<z<l-vw, 


g(w, z) = 24[(w +z)? — 27] z (wt z) = 24 w (223 4 38w2? + w?z), 


and 0 otherwise. Hence, the p.d.f. of the range is, for 0 << w <1, 
1l-—w 
h(w) = | g(w, z) dz = 12w(1 — w)?. 
0 


25. (a) Let fo be the marginal p.d.f. of Y. We approximate 


yte 
Pry-e<¥ <y+9= fa(t)dt = 2 fo(y). 
ye 


(b) For each s, we approximate 


yte 
f(s, t)dt = 2ef(s,y). 
y-e€ 
Using this, we can approximate 
xz 


Pr(X <a,y-e<¥ syte=/ 


=00) 


yre x 
/ f(s, t)dtds = 2¢ | f(s,y)ds. 
y—e —oo 
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(c) Taking the ratio of the approximation in part (b) to the approximation in part (a), we obtain 


Pr(X < sly—e< Y <y+t+e) 


~ 
~ 


Pr(X <a,y—-e<Y <yte) 
Priy—e<Y <y+e) 
J7x. f(s. y)ds 
f2(y) 


i gi(sly)ds. 


26. (a) Let Y = Xj. The transformation is Y = X, and Z = X, — X. The inverse is x1 = y and 
x2 =y—z. The Jacobian has absolute value 1. The joint p.d.f. of (Y, Z) is 


gly, z) = exp(—y — (y — z)) = exp(—2y + 2), 


for y >O and z < y. 
The marginal p.d.f. of Z is 


oF 1 1 
/ exp(—2y + z)dy = — exp(z) exp(—2 max{0, z}) = - 
max{0,z} 2 2 


if z>0, 
if z <0. 


exp(—z) 
exp(z) 


The conditional p.d.f. of Y = X, given Z = 0 is the ratio of these two with z = 0, namely 


gi(x1|0) = 2exp(—22), for x1 > 0. 


x2 = y/w. The Jacobian is 


1 
r=aet( 4}, 


The joint p.d.f. of (Y,W) is 


gly, w) = exp(—y — y/w)y/w* 


for y,w > 0. 
The marginal p.d.f. of W is 


y 


0 =_—_- — 
—y/w? } 


a F 


i yexp(—y(1 + 1/w))/w?dy = 


for w > 0. The conditional p.d.f. of Y = X, given W = 1 is the ratio of these two with w = 1, 


namely 


w?(1+ 1/w)? 


Let Y = X;. The transformation is now Y = X; and W = X1/X»2. The inverse is 7; = y and 


yexp(—y(1 + 1/w))/w?, 


1 1 
~ (1+ w)?’ 


hy(a1|1) = 4x1 exp(—2z21), for x; > 0. 


— 
oO 
Na 


The conditional p.d.f. g; in part (b) is supposed to be the conditional p.d.f. of X, given that Z 


is close to 0, that is, that |X, — X | is small. The conditional p.d.f. h; in part (d) is supposed 
to be the conditional p.d.f. of X; given that W is close to 1, that is, that |X /X2 —1| is small. 
The sets of (21,22) values such that |x, — x2| is small and that |x,/a2 — 1| is small are drawn 
in Fig. $.3.42. One can see how the two sets, although close, are different enough to account for 


different conditional distributions. 


27. The transition matrix is as follows: 


Players in 
game n 


Players in game n+ 1 


(A,B) (A,C) (B,C) 
(A, B) 
(A,C)| 06 0 0.4 
(B,C) | 08 0.2 0 
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~~ Boundary of |x1-x2|<0.1 E 
Boundary of |x1/x2-1|<.1 


\ 


Figure $.3.42: Boundaries of the two regions where |x — r2| < 0.1 and |a1/x2 —1| < 0.1 in Exercise 26e of 
Sec. 3.11. 


28. If A and B play in the first game, then there are the following two sequences of outcomes which result 
in their playing in the fourth game: 
i) A beats B in the first game, C beats A in the second game, B beats C' in the third game; 
ii) B beats A in the first game, C beats B in the second game, A beats C in the third game. 


The probability of the first sequence is (0.3) (0.4) (0.8) = 0.096. The probability of the second sequence 
is (0.7) (0.2) (0.6) = 0.084. Therefore, the overall probability that A and B will play again in the fourth 
game is 0.18. The same sort of calcuations show that this answer will be the same if A and C play in 
the first game or if B and C play in the first game. 


29. The matrix G and its inverse are 


—10 03 16 
G= 06 <1 10 1. 
0.8 0.2 1.0 
—0.5505 —0.4587 0.5963 
Gi = 0.0917 —0.8257 0.7339 


0.4220 0.2018 0.3761 


The bottom row of Gu! is the unique stationary distribution, (0.4220, 0.2018, 0.3761). 
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Chapter 4 


Expectation 


4.1 The Expectation of a Random Variable 


Commentary 


It is useful to stress the fact that the expectation of a random variable depends only on the distribution 
of the random variable. Every two random variables with the same distribution will have the same mean. 
This also applies to variance (Sec. 4.3), other moments and m.g.f. (Sec. 4.4), and median (Sec. 4.5). For this 
reason, one often refers to means, variance, quantiles, etc. of a distribution rather than of a random variable. 
One need not even have a random variable in mind in order to calculate the mean of a distribution. 


Solutions to Exercises 


1. The mean of X is 


ae b—a? a+b 
B(X) = f ef(e)de = | oe OG) ia 


b-a 
1 1 (100)(101) 
2. R(X) = —(14+24+---4+ 100) = ———— = 50.5. 
“ 00! aa ) 100 2 


3. The total number of students is 50. Therefore, 


20 22 4 3 1 
E(X) = 18( — 19| — 20{ — 21| — 25( — ) = 18.92. 
(*) (5) ne (=) . (=) i (55) iy (=) 


4. There are eight words in the sentence and they are each equally probable. Therefore, the possible values 
of X and their probabilities are as follows: 


1 5 1 
It foll that F(X) =2( = 3[ = 4{— 9 
ollows that E(X) (=) a (=) + (=) + (5 
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5. There are 30 letters and they are each equally probable: 


10. 


2 letters appear in the only two-letter word; 
15 letters appear in three-letter words; 

4 letters appear in the only four-letter word; 

9 letters appear in the only nine-letter word. 


Therefore, the possible values of Y and their probabilities are as follows: 


1 aml 1 
_=£ (=) — it — dx = — lim log(x) = oo. Since the integral is not finite, F (=) does not exist. 
0 x x0 xX: 


xX 


1 pa ‘ 1 
. E(XY) =[ [ xy: 12y dydz = 5. 


. If X denotes the point at which the stick is broken, then X has the uniform distribution on the interval 


(0, 1]. If Y denotes the length of the longer piece, then Y = max{X,1— X}. Therefore, 


1/2 1 


; 3 
EY) = [ max(z,1—2)dxr = [ (1 —2x)dx + ‘a acdz = 7 


Since a has the uniform distribution on the interval [—7/2, 7/2], the p.d.f. of a is 


Efe ene 
— for-~<a<-H, 
0 otherwise. 


a | 


0 1 


xy 


Figure $.4.1: Figure for Exercise 10 of Sec. 4.1. 
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Also, Y = tan(a). Therefore, the inverse transformation is a = tan~!Y and da/dy = 1/(1+ 7). As 
a varies over the interval (—7/2, 7/2), Y varies over the entire real line. Therefore, for —oo < y < co, 
the p.d.f. of Y is 


1 1 


gly) = F(tan™W) ae 


11. The p.d.f.’s of Y; and Y,, were found in Sec. 3.9. For the given uniform distribution, the p.d.f. of Yj is 


_fnQ—-y)"1! for 0<y<1, 
ny) = f otherwise. 


Therefore, 


1 
n+l 


1 
BM) = f yn — 9)"*ay = 
The p.d.f. of Y, is 


_ jay? ? for Ox y <1, 
In(y) = {3 otherwise. 


Therefore, 


n 
n+l 


1 
BY.) = f y-ny™ ‘dy = 


12. It follows from the probability integral transformation that the joint distribution of F'(X1),...,F (Xn) 
is the same as the joint distribution of a random sample from the uniform distribution on the interval 
[0,1]. Since F'(Y;) is the smallest of these values and F’(Y,,) is the largest, the distributions of these 
variables will be the same as the distributions of the smallest and the largest values in a random sample 
from the uniform distribution on the interval [0,1]. Therefore, E[F'(Y1)] and E/F(Y,,)| will be equal to 
the values found in Exercise 11. 


13. Let p = Pr(X = 300). Then E(X) = 300p + 100(1 — p) = 200p + 100. For risk-neutrality, we need 
E(X) = 110 « (1.058) = 116.38. Setting 200p + 100 = 116.38 yields p = 0.0819. The option has a value 
of 150 if X = 300 and it has a value of 0 if X = 100, so the mean of the option value is 150p = 12.285. 
The present value of this amount is 12.285/1.058 = 11.61, the risk-neutral price of the option. 


14. For convenience, we shall not use dollar signs in these calculations. 


(a) We need to check the investor’s net worth at the end of the year in four situations: 


i. X = 180 and she makes the transactions 
ii. X = 180 and she doesn’t make the transactions 
ili. X = 260 and she makes the transactions 


iv. X = 260 and she doesn’t make the transactions 
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Since we don’t know the investor’s entire net worth, we shall only calculate it relative to all other 
investments. This means that we only need to pretend as if the investor had one share of the 
stock worth 200. We don’t care what else she has. We need to show that cases (i) and (ii) lead 
to the same net worth and that cases (iii) and (iv) lead to the same net worth. In case (ii), her 
net worth will change by —20. In case (iv), her net worth will change by 60. In case (i), nobody 
will exercise the options. So she will sell the three extra shares for 180 each (total 540) and pay 
the loan of 519.24 plus interest 20.77 for a net 0.01 loss. Plus her one original share of stock has 
lost 20 and her net worth has changed a total of —20.01, which is the same as case (i) except for 
the accumulated rounding error. In case (iii), the options will be exercised, and she will receive 
800 for four shares of the stock. She will have to pay back the loan of 519.24 plus 20.77 in interest 
for a net gain of 259.99. But she no longer has the one share of stock that was worth 200, so her 
change in net worth is 59.99, the same as case (iv) to within the same one cent of rounding. 


(b) If the option price is x < 20.19, then the investor only receives 4x for selling the options, but still 
needs to pay 600 for the three shares, so she must borrow 600 — 4x. The rest of the calculations 
proceed just as in part (a) but we must replace 519.24 by 600 — 42, and the interest 20.77 must be 
replaced by 0.04(600 — 4x). That is, to pay back the loan with interest, she must pay 624 — 4.16x 
instead of 540.01. So she pays an additional 83.99 — 4.16z relative to the situation in case (a) 
regardless of what happens to the stock price. 


io) 
wa 


This situation is the same as part (b) except now the value of 83.99 — 4.16 is negative instead 
of positive, so the investor pays back less and hence makes additional profit rather than suffers 
additional loss. 


15. The value of the option is 0 if X = 260 and it is 40 if X = 180, so the expected value of the option is 
40(1 — p) = 40 x 0.65 = 26. The present value of this amount is 26/1.04 = 25. 


16. If f is the pf. of X, and Y = |X|, then for y > 0, Pr(Y = y) = Pr(X = y) + Pr(X = -y). In 
Example 4.1.4, Pr(X = y) = Pr(X = —y) = 1/[2y(y + 1)], and this makes Pr(Y = y) the pf. in 
Example 4.1.5. 


4.2 Properties of Expectations 


Commentary 


Be sure to stress the fact that Theorem 4.2.6 on the expected value of a product of random variables has the 
condition that the random variables are independent. This section ends with a derivation for the expectation 
of a nonnegative discrete random variable. Although this method has theoretical interest, it is not central to 
the rest of the text. 


Solutions to Exercises 


1. The random variable Y is equal to 10(R — 1.5) in dollars. The mean of Y is 10[E(R) — 1.5]. From 
Exercise 1 in Sec. 4.1, we know that E(R) = (—3+4+ 7)/2 =2, so E(Y) =5. 


9: BOX = 3X9 Xs 4) = OR Xy) = SEK) HPO) = 4 = 26) = 315) 45 — 4S] 4. 
3. 
El(% =2X5-+4 Xs)") = BOG LAX 4 XE AX Xe + OX Xs — 4K X35) 


= E(Xj) + 4E(X3) + E(X3) — 4E(X1X2) 
4} ORO hay 4b OG): 
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Since Xj, Xo, and X3 are independent, 
E(X;,Xj) = E(Xi)E(X;j) for i Aj. 
Therefore, the above expectation can be written in the form: 
E(X7) + 4E(X3) + E(X3) — 4E(X1)E(X2) + 2E(X1)E(X3) — 4E(X2) E(X3). 


Also, since each X; has the uniform distribution on the interval [0,1], then E(X;) = 5 and 
: uf 
E(X?2) -| dz =—. 
0 3 


Hence, the desired expectation has the value 1/2. 


. The area of the rectangle is XY. Since X and Y are independent, E(XY) = E(X)E(Y). Also, 
E(X) =1/2 and E(Y) =7. Therefore, (XY) = 7/2. 


. Fori=1,...,n, let Y; = 1 if the observation X; falls within the interval (a, b), and let Y; = 0 otherwise. 


b 
Then. 2(Y;) = Pry, =1) = | f(x)dx. The total number of observations that fall within the interval 
(a, b) is Yj +---+Y,, and : 


BY, +--+ ¥y) = BY) +--+ BO) =n fo fledde: 


. Let X; = 1 if the zth jump of the particle is one unit to the right and let X; = —1 if the ith jump is 
one unit to the left. Then, for i= 1,...,n, 


E(X;) = (-1)p + (). — p) = 1 — 2p. 
The position of the particle after n jumps is X; +---+ Xp, and 
E(X,+-+--+ Xp) = E(X1) +--+ E(Xp) = n(1 — 2p). 


. For i=1,...,n, let X; = 2 if the gambler’s fortune is doubled on the ith play of the game and let 
X; = 1/2 if his fortune is cut in half on the ith play. Then 


nsy=a(8)+(2)@)=4 


After the first play of the game, the gambler’s fortune will be cX,, after the second play it will be 
(cX,)X2, and by continuing in this way it is seen that after n plays the gambler’s fortune will be 
cX ,X...Xn. Since X1,..., Xn are independent, 

5 


E(cX, ...Xq) = cB(X;)... (Xn) = (7) - 


. It follows from Example 4.2.4 that 


10 16 
BOs ==. 
- (35) 5 
: 24 8 
Since Y = 8— X, E(Y) =8- E(X) = =" Finally, F(X —Y) = E(X)- E(Y)= = 
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9. We know that E(X) = np. Since Y =n—X, E(X —Y) = E(2X% — n) = 2E(X) —n=n(2p—- 1). 


10. (a) Since the probability of success on any trial is p = 1/2, it follows from the material presented at 
the end of this section that the expected number of tosses is 1/p = 2. 


(b) The number of tails that will be obtained is equal to the total number of tosses minus one (the 
final head). Therefore, the expected number of tails is 2—1 = 1. 


11. We shall use the notation presented in the hint for this exercise. It follows from part (a) of Exercise 10 
that E(X;) = 2 fori=1,...,k. Therefore, 


E(X) = E(X1) +--+ + E(X,) = 2k. 


12. (a) We need the p.d-f. of X = 54R; + 110Ry where R, has the uniform distribution on the interval 
[—10, 20] and Rg has the uniform distribution on the interval [—4.5, 10]. We can rewrite X as 
X, + X2 where X; = 54R, has the uniform distribution on the interval [—540, 1080] and X2 = 
110R_ has the uniform distribution on the interval [—495, 1100]. Let f; be the p.d.f. of X; for 
i = 1,2, and use the same technique as in Example 3.9.5. First, compute 


Aiea = 3.87 x 10-7 for —540 < z < 1080 and —495 < x — z < 1100, 
eda ~ ) 0 otherwise. 


We need to integrate this over z for each fixed x. The set of x for which the function above is ever 
positive is the interval |—1035, 2180]. For —1035 < x < 560, we must integrate z from —540 to 
x +495. For 560 < x < 585, we must integrate z from x — 1100 to x + 495. For 585 < x < 2180, 
we must integrate z from x« — 1100 to 1080. The resulting integral is 


3.87 x 107’ + 4.01 x 10-4 for —1035 < x < 560, 
gej)=< 6.17 x 10-* for 560 < x < 585, 
8.44 x 10-4 — 3.87 x 10-2 for 585 < x < 2180. 


(b) We need the negative of the 0.03 quantile. For —1035 < x < 560, the c.d.f. of X is 
_ 3.87 x 107" (a? — 1035?) 
— 


This function is a second degree polynomial in x. To be sure that the 0.03 quantile is between 
—1035 and 560, we compute F'(—1035) = 0 and F'(560) = 0.493, which assures us that the 0.03 
quantile is in this range. Setting F(x) = 0.03 and solving for x using the quadratic formula yields 
x = —642.4, so VaR is 642.4. 


F(x) + 4.01 x 10~4(2 + 1035). 


13. Use Taylor’s theorem with remainder to write 


(A= bu)? " 


W(X) = gu) + (X — p)g'(u) + —3——9"(¥), (S.4.1) 


where yp = E(X) and Y is between X and py. Take the mean of both sides of (S.4.1). We get 


_ 2 
Bla X)] = g(u) +0-+ 8 (v0) | 


The random variable whose mean is on the far right is nonnegative, hence the mean is nonnegative and 
Elg(X)] 2 g(u). 
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4.3 Variance 
Commentary 
Be sure to stress the fact that Theorem 4.3.5 on the variance of a sum of random variables has the condition 
that the random variables are independent. 
Solutions to Exercises 
1. We found in Exercise 1 of Sec. 4.1 that E(X) = (0+ 1)/2 =1/2. We can find 


1 
B(x?) = f w'de = 3. 
0 


So Var(X) = 1/3 — (1/2)? = 1/12. 


2. The p.f. of X and the value of E(X) were determined in Exercise 4 of Sec. 4.1. From the p.f. given 
there we also find that 


E(X?) = (=) + (2) +#(2) #(2) = 7 


3 f15\?_ GF 
Therefore, Var(X) = E(X?) — [E(X)]? = = (>) aT 


3. The p.d.f. of this distribution is 


1 


f <a<b, 
ja) foe ora << 
0 otherwise. 
b+a 
Therefore, F(X) = and 


P= [ ger 3b a) ; 
It follows that Var(X) = E(X?) — [E(X)}? = 7(b — a). 
4. E[X(X —1)] = E(X? — X) = E(X?) — p= Var(X) + [E(X)P -— pao? +p? — p. 
5. E[(X —c)?] = E(X?)-2cE(X)+c? = Var(X)+[E(X)}? —2cu+e? = 0% +p? -2cu+e? = 0? +(u—c)?. 
6. Since E(X) = E(Y), E(X — Y) = 0. Therefore, 
E|(X —Y)?] = Var(X —Y) = Var[X + (-Y)]. 
Since X and —Y are independent, it follows that 
E|(X — Y)?] = Var(X) + Var(—Y) = Var(X) + Var(Y). 


7. (a) Since X and Y are independent, Var(X — Y) = Var(X) + Var(Y) =34+3=6. 
(b) Var(2X — 3Y +1) = 2? Var(X) + 3? Var(Y) = 4(3) + 9(3) = 39. 
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8. Consider a p.d.f. of the form 


but 


1 
é— fora > 1, 

f(@j=4 2 
0 for x < 1. 


2 caer . ; 
E(X*)= | c—dx is not finite. 
i 2 


Therefore, E(X) is finite but E(X7) is not. Therefore, Var(X) is not finite. 


9. The mean of X is (n+ 1)/2, and the mean of X? is Sok? /n = (n+1)(2n + 1)/6. So, 


k=1 


nm mr nm . v= 
Var(x) = ! 10C +1) 4 +h _ 


10. The example efficient portfolio has s; = 524.7, s9 = 609.7, and s3 = 39250. 


(a) 


We know that R, has a mean of 6 and a variance of 55, while Ry has a mean of 4 and a variance of 
28. Since we are assuming that R; has the uniform distribution on the interval [a;,b;] for i = 1,2, 
we know that 


aj + dj 
E(Ri) = ——: 

b; — a)? 
Var(R;) = = 


(See Exercise 1 of this section for the variance of a uniform distribution.) For i = 1, we set 
(a, + b1)/2 = 6 and (b; — a,)?/12 = 55. The solution is aj = —6.845 and b; = 18.845. For 7 = 2, 
we set (a2 + b)/2 = 4 and (be — ag)?/12 = 28. The solution is ag = —5.165 and by = 13.165. 


Let X; = s,;R, and Xo = sR. Then the distribution of X, is the uniform distribution on the 
interval [—3591.6, 9888.0], and X2 has the uniform distribution on the interval [—3149.1, 8026.7]. 
The value of the return on the portfolio is Y = X, + X2+1413. We need to find the 0.03 quantile 
of Y. As in Exercise 12 of Sec. 4.2, the p.d.f. of Y will be linear for the lowest values of y. Those 
values are —5327.7 < y < 5848.1. The line is g(y) = 6.638 x 10-°y + 3.537 x 10~°. In this range, 


the c.d.f. is 
6.638 x 1079 
G(y) = ——— 


Since G(—5327.7) = 0 and G(5848.1) = 0.4146, we know that the 0.03 quantile is in this range. 
Setting G(y) = 0.03, we find y = —2321.9. So VaR is 2321.9. 


(y° — 5327.77) +3.537 x 107° (y + 5327.7). 


11. The quantile function of X can be found from Example 3.3.8 with a = 0 and b = 1. It is F~!(p) =p. 
So, the IQR is 0.75 — 0.25 = 0.5. 


12; 


13. 


14. 
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The c.d.f. is F(x) = 1 — exp(—z), for x > 0 and F(x) = 0 for x < 0. The quantile function is 
F-'(p) = —log(1 — p). So, the 0.25 and 0.75 quantiles are respectively —log(0.75) = 0.2877 and 
— log(0.25) = 1.3863. The IQR is then 1.3863 — 0.2877 = 1.0986. 


From Table 3.1, we find the 0.25 and 0.75 quantiles of the distribution of X to be 1 and 2 respectively. 
This makes the IQR equal to 2—1= 1. 


The result will follow from the following general result: If x is the p quantile of X and a > 0, then az 
is the p quantile of Y = aX. To prove this, let F’ be the c.d.f. of X. Note that x is the greatest lower 
bound on the set C, = {z: F(z) > p}. Let G be the c.d-f. of Y, then G(z) = F(z/a) because Y < z if 
and only if aX < z if and only if X < z/a. The p quantile of Y is the greatest lower bound on the set 


D, = {y: GY) 2 pl=t{y: fy/a@) Sp} = {02:2 (2) 2D} = 0Gys 


where the third equality follows from the fact that F(y/a) > p if and only if y = za where F(z) > p. 
The greatest lower bound on aC, is a times the greatest lower bound on C, because a > 0. 


4.4 Moments 


Commentary 


The moment generating function (m.g.f.) is a challenging topic that is introduced in this section. The m.g.f. 
is used later in the text to outline a proof of the central limit theorem (Sec. 6.3). It is also used in a few 
places to show that certain sums of random variables have particular distributions (e.g., Poisson, Bernoulli, 
exponential). If students are not going to study the proofs of these results, one could skip the material on 
moment generating functions. 


Solutions to Exercises 


1. 


Since the uniform p.d.f. is symmetric with respect to its mean ps = (a+b)/2, it follows that E[(X—)°] = 
0. 


. The mean of X is (b+ a)/2, so the 2kth central moment of X is the mean of (X — [b + a]/2)?*. Note 


that Y = X — [b+ a]/2 has the uniform distribution on the interval [—(b — a)/2,(b — a)/2]. Also, 
Z = 2Y/(b—a) has the uniform distribution on the interval [—1, 1]. So E(Y?*) = [(b — a) /2]?* E(Z?*). 


So, the 2kth central moment of X is [(b — a) /2]?*/(2k + 1). 


EX =) |= BX 1) | = Bt = 8x" 8X 1) = FS 99) a) 1. 


. Since Var(X) > 0 and Var(X) = E(X?) — [E(X)]?, it follows that E(X?) > [E(X)]?. The second part 


of the exercise follows from Theorem 4.3.3. 


. Let Y = (X — yp)’. Then by Exercise 4, 


B(Y?) = BU(X — p)4] > [EYP = [Var(X)P = 04. 
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6. Since 


for a<a<b, 


0 otherwise, 


then 


dx. 


b 
v(t) = [ exp(t) = 


Therefore, for t 4 0, 


i= oe 
As always, ¥(0) = 


(3exp(t) + exp(—t)). Therefore, w = y’(0) = 1/2 and 


mele 


7. W(t) = Z(Bexp(t) — exp(—4)) and y"(f) = 
2 
a = w"(0) — pw? =1- (5) = -. 


8. w(t) = (2t+3)exp(t? + 3t) and w"(t) = (2t+3)exp(t? + 3t) + 2exp(¢? + 3t). Therefore, w = (0) =3 
and 0? = (0) — wp? = 11 — (3)? = 2. 


9. h(t) = cv} (t) exp(c[a1(t) — 1]) and w"’2(t) = {few} ()]? + ed (t)} exp(e[a1 (¢) — 1]). We know that 
¥1(0) = 1, v4 (0) =H, and #1 (0) = a +. ee 


Therefore, E(Y) = #4(0) = cu and 


Var(¥) = 49(0) — [E(Y)]° = {(cy1)? + eo? + p?)} — (cp)? = e(o? + p?). 
10. The m.g.f. of Z is 


tit) = E(exp(tZ)) = Elexp(t(2X — 3Y +4))] 
= exp(4t)E( 
= exp(4t)E(exp(2txX))E(exp(—3tY)) (since X and Y are independent) 
= exp(4t)o(2t)y(—3t) 

= exp(4t) exp(4t? + 6t) exp(9t? — 9t) 

= exp(13t? +t). 


11. If X can take only a finite number of values 71,...,2, with probabilities p,,...,p,, respectively, then 
the m.g.f. of X will be 


w(t) = pi exp(ta1) + po exp(ta2) +--+ + pp exp(tag). 


By matching this expression for y(t) with the expression given in the exercise, it can be seen that X 
can take only the three values 1, 4, and 8, and that f(1) = 1/5, f(4) = 2/5, and f(8) = 2/5. 


12: 


13. 


14. 


15. 


16. 


17. 
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We shall first rewrite 7)(t) as follows: 


4 1 1 
p(t) = = exp(0) + = exp(t) + = exp(—t). 
6 6 6 
By reasoning as in Exercise 11, it can now be seen that X can take only the three value 0, 1, and —1, 
and that f(0) = 4/6, f(1) =1/6, and f(—1) = 1/6. 
The m.g.f. of a Cauchy random variable would be 
co exp(tz) 
i= ————dx. S.4.2 
oe [. m(1 + 2?) * ( ) 
It ¢ > 0, Jim exp(tx)/(1 + 2”) = oo, so the integral in Eq. (S.4.2) is infinite. Similarly, if t < 0, 
lim exp(ta)/(1 + 2?) = oo, so the integral is still infinite. Only for t = 0 is the integral finite, and 
@r——0o 
that value is 7)(0) = 1 as it is for every random variable. 
The m.g.f. is 
© exp(tx 
w(t) = i 


x 


If t < 0, exp(tx) is bounded, so the integral is finite. If t > 0, then Jim exp(tx)/x? = oo, and the 


integral is infinite. 


Let X have a discrete distribution with p.f. f(x). Assume that E(|X|") < co for some a > 0. Let 
0<b<a. Then 


E(|X|’) VMilePs@) = Yo lel’ FO) + DY lef) 


|x|<1 |x|>1 


lA 


1+ >° lel*f(e) S14 E(|X|*) < on, 
|a|>1 


where the first inequality follows from the fact that 0 < |a|® < 1 for all |a| < 1 and |z|° < |z| for all 
|x| > 1. The next-to-last inequality follows from the fact that the final sum is only part of the sum 
that makes up E(|X|°). 


Let Z = n—X. It is easy to see that Z has the same distribution as Y since, if X is the number of 
successes in n independent Bernoulli trials with probability of success p, then Z is the number of failures 
and the probability of failure is 1 — p. It is known from Theorem 4.3.5 that Var(Z) = Var(X), which 
also equals Var(Y). Also F(Z) =n—- E(X), so Z—-E(Z) =n-X —-—n+E(X) = E(X) —X. Hence 
the third central moment of Z is the negative of the third central moment of X and the skewnesses are 
negatives of each other. 


We already computed the mean ps = 1 and variance o? = 1 in Example 4.4.3. Using the m.g.f., the 
third moment is computed from the third derivative: 


MI 6 
y(t) = Gp 


The third moment is 6. The third central moment is 


E([(X — 18) = E(X®) — 3E(X?) + 3E(X) -1=6-64+3-1=2. 


The skewness is then 2/1 = 2. 
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4.5 The Mean and the Median 


Solutions to Exercises 


1. 


6. 


The 1/2 quantile defined in Definition — ee to a continuous random variable whose c.d.f. is 
one-to-one. The 1/2 quantile is then 29 = F~+(1/2). That is, F(ao) = 1/2. In order for a number m 
to be a median as define in this section, — must be that Pr(X < m) > 1/2 and Pr(X > m) > 1/2. 
If X has a continuous distribution, then Pr(X < m) = F(m) and Pr(X > m) = 1—- F(m). Since 
F (x9) = 1/2, m = zo is a median. 


. In this example, 37°_, f(x) = 21c. Therefore, c = 1/21. Since Pr(X < 5) = 15/21 and Pr(X > 5) = 


11/21, it follows that 5 is a median and it can be verified that it is the unique median. 


. A median m must satisfy the equation 


mM at 
i exp(—2x)dz = -. 
0 2 


Therefore, 1 — exp(—m) = 1/2. It follows that m = log 2 is the unique median of this distribution. 


. Let X denote the number of children per family. Then 


21+40+42_ 1 


rye oa ee 
153 2 
and 
42 + 27 + 23 i 
Se a 
153 2 


Therefore, 2 is the unique median. Since all families with 4 or more children are in the upper half of 
the distribution no matter how many children they have (so long as it is at least 4), it doesn’t matter 
how they are distributed among the values 4, 5, .... Next, let Y = min{X,4}, that is Y is the number 
of children per family if we assume that all families with more than 4 children have exactly 4. We can 
compute the mean of Y as 


E(V) = = (0x 21-41% 404.2 x 4243 2744 x 23) = 1.941. 


153 


. The p.d.f. of X will be A(z) = [f(x) + g(x)]/2 for —oo < x < oo. Therefore, 


= 5 [aR + oleae = Suey +H): 


1 1 oo 4] 1 
Since / h(X)dz = sf(x)dx = 5 and | h(z\ae = | g9(u)dx =F it follows that every value 
0 2 2 


—Cco 
of m in the interval 1 < m < 2 will be a median. 


(a) The required value is the mean E(X), and 


s 2 
B(x) = f x: dadx = 3 


fe 


8. 


10. 


Li; 


12. 
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(b) The required value is the median m, where 


um 1 
2a dx = x. 
[ rdx = 5 


Therefore, m = 1/V2. 


(a) The required value is E(X), and 


B(x) = [2 (2+5)ar= <5. 


(b) The required value is the median m, where 


nt 1 1 
Therefore, m = (V5 — 1)/2. 


E((X — d)*] = E(X*) — 4E(X?)d + 6E(X?)d? — 4E(X)d? + d*. Since the distribution of X is symmet- 
ric with respect to 0, 


AX) = 2) = 0, 
Therefore, 
E|(X — d)*] = E(X*) + 6E(X*)@ + d*. 


For any given nonnegative values of E(X*) and E(X7), this is a polynomial of fourth degree in d and 
it is a minimum when d = 0. 


(a) The required point is the mean F(X), and 
E(X) = (0.2)(—3) + (0.1)(—1) + (0.1)(0) + (0.4)(1) + (0.2)(2) = 0.1. 


(b) The required point is the median m. Since Pr(X < 1) = 0.8 and Pr(X > 1) = 0.6, the point 1 is 
the unique median. 


Let x1 <--- <x, denote the locations of the n houses and let d denote the location of the store. We 
n nm 
must choose d to minimize * |x; — d| or equivalently to minimize x |x; — d|/n. This sum can be 


interpreted as the M.A.E. at ator a discrete distribution in which ach of the n points 71,...,%, has 
probability 1/n. Therefore, d should be chosen equal to a median of this distribution. If n is odd, then 
the middle value among 71,..., 2p is the unique median. If n is even, then any point between the two 
middle values among 21,...,2 , (including the two middle values themselves) will be a median. 


The M.S.E. of any prediction is a minimum when the prediction is equal to the mean of the variable 
being predicted, and the minimum value of the M.S.E. is then the variance of the variable. It was 
shown in the derivation of Eq. (4.3.3) that the variance of the binomial distribution with parameters 
n and p is np(1—p). Therefore, the minimum M.S.E. that can be attained when predicting X is 
Var(X) = 7(1/4)(3/4) = 21/16 and the minimum M.S.E. that can be attained when predicting Y is 
Var(Y) = 5(1/2)(1/2) = 5/4 = 20/16. Thus, Y can be predicted with the smaller M.S.E. 


(a) The required value is the mean E(X). The random variable X will have the binomial distribution 
with parameters n = 15 and p= 0.3. Therefore, F(X) = np = 4.5. 
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(b) The required value is the median of the binomial distribution with parameters n = 15 and p = 0.3. 
From the table of this distribution given in the back of the book, it is found that 4 is the unique 
median. 


To say that the distribution of X is symmetric around m, means that X and 2m — X have the same 
distribution. That is, Pr(X < x) = Pr(2m — X < 2) for all x. This can be rewritten as Pr(X < x) = 
Pr(X > 2m— 2). With « = m, we see that Pr(X < m) = Pr(X > m). If Pr(X < m) < 1/2, then 
Pr(X < m)+Pr(X > m) < 1, which is impossible. Hence Pr(X < m) > 1/2 and Pr(X > m) > 1/2, 
and m is a median. 


The Cauchy distribution is symmetric around 0, so 0 is a median by Exercise 13. Since the p.d.f. of 
the Cauchy distribution is strictly positive everywhere, the c.d.f. will be one-to-one and 0 is the unique 
median. 


(a) Since a is assumed to be a median, F(a) = Pr(X < a) > 1/2. Since b > a is assumed to be a 
median Pr(X > 6) > 1/2. If Pr(X <a) > 1/2, then Pr(X < a)+Pr(X > b) > 1. But {X < a} 
and {X > b} are disjoint events, so the sum of their probabilities can’t be greater than 1. This 
means that F(a) > 1/2 is impossible, so F(a) = 1/2. 

(b) The c.d.f. F is nondecreasing, so A = {x : F(x) = 1/2} is an interval. Since F is continuous 
from the right, the lower endpoint c of the interval A must also be in A. For every x, Pr(X < 
x) +Pr(X >a) >1. For every x € A, Pr(X < x) = 1/2, hence it must be that Pr(X > x) > 1/2 
and x is a median. Let d be the upper endpoint of the interval A. We need to show that d is 
also a median. Since F' is not necessarily continuous from the left, F(d) > 1/2 is possible. If 
F(d) = 1/2, then d € A and d is a median by the argument just given. If F(d) > 1/2, then 
Pr(X = d) = F(d) — 1/2. This makes 


Pr(X > d) = Pr(X > d) + Pr(X =d) =1— F(d) + F(d) — 1/2 =1/2. 


Hence d is also a median 


(c) If X has a discrete distribution, then clearly /' must be discontinuous at d otherwise F(x) = 1/2 
even for some x > d and d would not be the right endpoint of A. 


We know that 1 = Pr(X < m)+Pr(X = m)+Pr(X > m). Since Pr(X < m) = Pr(X > m), 
both Pr(X < m) < 1/2 and Pr(X > m) < 1/2, otherwise their sum would be more than 1. Since 
Pr(X <m) < 1/2, Pr(X >m)=1—Pr(X < _m) > 1/2. Similarly, Pr(X <m) =1—Pr(X > m) > 1/2. 
Hence m is a median. 


As in the previous problem, 1 = Pr(X < m)+Pr(X =m) + Pr(X >m). Since Pr(X < m) < 1/2 and 
Pr(X > m) < 1/2, we have Pr(X > m)=1-—Pr(X <m) > 1/2 and Pr(X < m) =1-—Pr(X >m) > 
1/2. Hence m is a median. Let k > m. Then Pr(X > k) < Pr(X > m) < 1/2, and k is not a median. 
Similarly, if k <_m, then Pr(X < k) < Pr(X < m) < 1/2, and k is not a median. So, m is the unique 
median. 


2 
= 


Let m be the p quantile of X, and let r be strictly increasing. Let Y = r(X) and let G(y) be the c.d.f. 
of Y while F(x) is the c.d.f. of X. Since Y < y if and only if r(X) < y if and only if X < r~'(y), we 
have G(y) = F(r~!(y)). The p quantile of Y is the smallest element of the set 


Cp = {y: Gly) > vp} = {y: F(r*(y)) = vp} = {r(x) : F(x) > ph. 


Also, m is the smallest x such that F(a) > p. Because r is strictly increasing, the smallest r(a) such 
that F(x) > p is r(m). Hence, r(m) is the smallest number in Cp. 
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4.6 Covariance and Correlation 


Solutions to Exercises 


1. The location of the circle makes no difference since it only affects the means of X and Y. So, we 
shall assume that the circle is centered at (0,0). As in Example 4.6.5, Cov(X,Y) = 0. It follows that 
p(X, Y) = 0 also. 


2. We shall follow the hint given in this exercise. The relation [(X — x)+ (Y — py)? > 0 implies that 
(X = ux)(¥ = pr) $ 5[(X - x)? + = py)h 
Similarly, the relation [(X — x) — (Y — py)? > 0 implies that 
=(X = px)(¥ = py) S 51K = x)? + (Y - wr? 
Hence, it follows that 


[(X — wx)’ + (¥ — py’). 


Oe Bes 


\(X — ux)(¥ — py)| < 
By taking expectations on both sides of this relation, we find that 


[Var(X) + Var(Y)] < oo. 


Notre 


E||(X — px)(Y — py) |] < 
Since the expectation on the left side of this relation is finite, it follows that 
Cov(X, Y) = E[(X — wx)(Y — py)] 


exists and is finite. 


3. Since the p.d.f. of X is symmetric with respect to 0, it follows that E(X) =0 and that E(X*) = 0 for 
every odd positive integer k. Therefore, E(XY) = E(X") = 0. Since E(XY) = 0 and E(X)E(Y) =0, 
it follows that Cov(X,Y) =0 and p(X, Y) =0. 


4. It follows from the assumption that 0 < E(X*) < co, that 0 < 0% < oo and 0 < 0% < co. Hence, 
p(X,Y) is well defined. Since the distribution of X is symmetric with respect to 0,E(X) =0 and 
E(X°) =0. Therefore, E(XY) = E(X?) = 0. It now follows that Cov(X,Y) =0 and p(X,Y) =0. 


5. We have E(aX + b) = aux +band E(cY + d) = cpy +d. Therefore, 


Cov(aX +b,cY +d) = El(aX +b-—apx — b)(cY +d—cpy — d)| 
Elac(X — px)(Y — wy)] = acCov(x,Y). 


6. By Exercise 5, Cov(U, V) = acCov(X,Y). Also, Var(U) = a2o% and Var(V) = c?o0?.. Hence 


pU, V) — 


—p(X,Y) if ac<0. 


aeCov(X,Y) | piXY) a oe>0; 
lalox - |eloy 
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7. We have E(aX + bY +c) =apx + buy +c. Therefore, 


CoviaX +bY +¢,Z) = 


8. We have 
Cov [Saax, So] = E sae mx) oh (Yj — by, 
i=1 i=1 j=l 
= Eyd > aibj(X ined 
t=1j=1 
= dV aide [OG - nx.) - y)| 
i=1 j=1 
= Sy ayb; Cov( Xj, Yj). 
i=1 j=1 
9. LeU SA +Y and VY =X =F. Then 
EUV) =F XeYX=¥)| = £00 = 7) = 2X") = 20). 
Also, 
EU)E(V) = E(X +Y)E(X —Y) = (ux + py) (ux — wy) = Wi — By. 
Therefore, 
Cov(U,V) = B(UV) — EWU)E(V) = [B(X?) = p32) - (BW?) = 
Var(X) — Var(Y) = 0. 
It follows that p(U,V) = 0. 
10. Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y) and Var(X — Y) = Var(X) + Var(Y) — 2Cov(X,Y). 


E\(aX + bY + ¢ 
E{[a(X — wx) + 0(Y — py)|(Z — uz) 


apex — buy 


c)(Z — pz)] 


= aE|(X —ux)(Z— pz)| + bE[(Y — py )(Z — wz) 
= aCov(X, Z) + bCov(Y, Z). 


Since Cov(X, Y) < 0, it follows that 


11: 


Var(X + Y) < Var(X —Y). 


For the given values, 


Var(X) py @, Gn eee 3) 
Var(Y) = E(Y)-[E 
Cov(X,Y) E(XY)- 
Therefore, 


a 6 
py) = at which is impossible. 


Section 4.6. Covariance and Correlation 123 


12. 
1 2 1 5 
F(X) = [ fe xe +vdyae = =, 
o JO 3 9 
1 2 1 11 
E(Y) = [ fy xe@t payee ==, 
o JO 3 9 
E(X?) = ak é Lie 4. \dy dx = : 
= a x 3 y)ay ~ 79? 
1 2 1 i 
ny) = [ fe xe@t waver =F, 
o Jo 3 9 
1 2 1 9 
BE XY) = [ fey 3@ + vdyae = =. 
0 Jo 3 3 
Therefore, 
v4 Sy°. 13 
Ss eee 
va 18 (3) 162’ 
16 1" .28 
VY SS ee ef = 
ve) 9 (=) 81" 
2 5 11 1 
Cwyixv) = 2=/[2\f(=)= 
our) 3 (5) (>) 81 
It now follows that 
Var(2X —3Y +8) = 4Var(X) +9 Var(Y) — (2)(2)(3) Cov(X, Y) 
245 
8 
il 
13. Cov(X, Y) = p(X, Y )oxoy = — (32) =-—1. 
a) Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y) = 11. 


(a) Var( 
(b) Var( 

14. (a) Var(X + Y + Z) = Var(X) + Var(Y) + Var(Z) + 2 Cov(X, Y) + 2 Cov(X, Z) + 2 Cov(Y, Z) = 17. 
(b) Var( 


59. 


15. Since each variance is equal to 1 and each covariance is equal to 1/4, 
Var(X1+---+Xn) = S- Var (X;) + 25° » Cov(X;, X;) 
i i<j 


n(1) + 2 mMn=) (3) =n+ — 


16. We need the cost to be 6000 dollars, so that 50s; + 30s2 = 6000. We also need the variance to be 0. 
The variance of s;R, + s9Rp is 


st Var(R1) + s3 Var(Rz) + 25152 Cov(R1, R2). 
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The variances of R; and Rz are Var(R,) = 75 and Var(R2) = 17.52. Since the correlation between 
R, and Rp is —1, their covariance is —1(75)'/?(17.52)!/2 = —36.25. To make the variance 0, we need 
758? + 17.5283 — 36.25s1s9 = 0. This equation can be rewritten (751/251 — 17.521/2 59)? = 0. So, we 
need to solve the two equations 


751/25, — 17.52/25. =0, and 50s, + 3089 = 6000. 


The solution is sj = 53.54 and sg = 110.77. The reason that such a portfolio is unrealistic is that it 
has positive mean (1126.2) but zero variance, that is one can earn money with no risk. Such a “money 
pump” would surely dry up the moment anyone recognized it. 


17. Let wx = E(X) and wy = E(Y). Apply Theorem 4.6.2 with U = X — px and V = Y — py. Then 
(4.6.4) becomes 


Cov(X,Y)? < Var(X) Var(Y). (S.4.3) 


Now |p(X,Y)| = 1 is equivalent to equality in (S.4.3). According to Theorem 4.6.2, we get equality 
in (4.6.4) and (S.4.3) if and only if there exist constants a and b such that aU + bV = 0, that is 
a(X — wx) +(Y — py) = 0, with probability 1. So |o(X, Y)| = 1 implies aX + bY = aux = buy with 
probability 1. 


18. The means of X and Y are the same since f(x,y) = f(y,x) for all x and y. The mean of X (and the 
mean of Y) is 


1 pl lq y 1 1 
B(x) = ff oe + ydedy = [ (5 +4) dy=54+5=5. 


Also, 


a i : A i by. 4 


So, 


i 7 \? 
Cov(X,Y) = 3 - a = —0.00695. 


4.7 Conditional Expectation 


Solutions to Exercises 


1. The M.S.E. after observing X = 18 is Var(P|18) = 19 x (41 — 18) /[42? x 43] = 0.00576. This is about 
seven percent of the marginal M.S.E. 


2. If X denotes the score of the selected student, then 
E(X) = E[E(X | School)] = (0.2)(80) + (0.3)(76) + (0.5)(84) = 80.8. 


3. Since E(X | Y) =c, then E(X) = E[E(X | Y)] =c and 
E(XY) = E[E(XY | Y)] = E(VE(X | Y)] = ElcY) =cE(Y). 


Therefore, 


Cov(X,Y) = E(XY) — E(X)E(Y) =cE(Y) —cE(Y) =0. 
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4, Since X is symmetric with respect to 0, E(X*) = 0 for every odd integer k. Therefore, 
BPP Y) =F EOe"y | X= BX EY |) = Blane ei) = ee). 
Also, 
E(Y) = E[E(Y | X)] = E(axX +6) = 6b. 
It follows that 
Cov( 2" VY) = BOC" YY) — BOC BY) = bf 0C™ = BOC”) = 0, 
5. For any given value rp_; of Xpn_-1, E(Xy | 2n—1) will be the midpoint of the interval (z,_1, 1). Therefore, 


eS ee 
E(Xn | Xn—1) = a a 


It follows that 


1 1 il 
Similarly, E(Xp-1) = 5 + gE (Xn-2); etc. Since F(X 1) = rt we obtain 


(oe oe 1 1 
ENR gt gee pg te 


6. The joint p.d.f. of X and Y is 


_ je for pat <1; 
Fla,y) = ' otherwise. 


Therefore, for any given value of y in the interval —1 < y < 1, the conditional p.d.f. of X given that 
Y =y will be of the form 


_ f(@y) _ for -J1-y<2<V1-¥, 
gz ly) = Fal) -{ foly) 


0 otherwise. 


For each given value of y, this conditional p.d.f. is a constant over an interval of values of x symmetric 
with respect to x = 0. Therefore, E(X | y) = 0 for each value of y. 


7. The marginal p.d.f. of X is 
: 1 
file) = f (e+ y)dy=0+ 5 tor “Oa go 1, 
0 


Therefore, for 0 < x < 1, the conditional p.d.f. of Y given that X = x is 


2 alae leader ag for0<y <1. 
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il 7 i) 11 
. The prediction is E (y (A = 5) = 8 and the M.S.E. is Var G [A= 5) = — 
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Hence, 
E(Y | 2) any 3x2 + 2 
tt) = ——— 
0 2e+1 3(2z + 1) 
H(Y?|2) = gc + y°) — 42+3 
fo Qe tl ~ 6(2a +1)’ 
and 
4 +3 382+2]? 1 1 
V: Y SS SS: = —_—_—_—_—_ = Eo 
ar(¥ 12) = Gorey soe a 5 36 | (Qr+ nd 


2 144° 


. The overall M.S.E. is 


1 
E[Var(¥ | X)] = i _ [3 oreipl AGE 


It was found in the solution of Exercise 7 that 


1 
file) =a2+ 5 for0O<a <1. 


1 log 3 
Therefore, it can be found that E[Var(Y | X)] = ace 
12 «144 
1 log 
It was found in Exercise 9 that when Y is predicted from X, the overall M.S.E. is ria a a . Therefore, 
the total loss would be 
1 log 3 
DR i ** 


If Y is predicted without using X, the M.S.E. is Var(Y). It is found that 


1 1 7 
=f f we+u)dedy = 5 


and 


5 
(2 + y) x dy = Tt 


5 (eo wae 
Hence, Var(Y) = oo “ay. Tat’ The total loss when X is used for predicting Y will be less than 


. : log 3-1 
Var(Y) if and only ifc< a 


ll 


12. 


13. 
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Let E(Y) = py. Then 


Var(Y) E\(Y — py)?] = E{[(¥ — E(¥ | X)) + (E(¥ | X) — py)? } 
E{[Y — E(Y | X)P}+ 2E{[Y — EY | X)[EW | X) — py]} 


+E{[E(Y | X) — py]’}. 


We shall now consider further each of the three expectations in the final sum. First, 
B{Y — E(Y | X)°} = (BLY — BY | XP | XP) = BlVar(¥ |X). 
Next, 


E{[Y —E(Y | X)[EY|X)—py]} = E(EIY — EY | X)[EW | X) — wy] | X}) 
= E(\E(Y | X)—pylE{Y — EY | X)| X}) 
= E(E(Y |X) — py] -9) 
= @. 


Finally, since the mean of E(Y | X) is E[E(Y | X)] = py, we have 
EX[E(Y | X) — py]?} = Var[E(Y | X)]. 
It now follows that 
Var(Y) = E[Var(Y | X)] + Var[E(Y | X)]. 
Since E(Y) = E|E(Y | X)], then 
E(Y) =ak(X) +6. 
Also, as found in Example 4.7.7, 
E(XY) = ak(X*) + 02 (X). 
By solving these two equations simultaneously for a and b we obtain, 


_ E(XY)- E(X)E(Y) _ Cov(X,Y) 
E(X?) —[E(X)]2 ss“ Var(X) 


and 
b= E(Y) —ak(X). 


(a) The prediction is the mean of Y: 


yn=f fv = (2 + By)de dy = 2. 
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(b) The prediction is the median m of X. The marginal p.d-f. of X is 
‘2 1 
A(x) = [ 5 (20 + 3y)dy = = (4a + 3) for 0 < eo 1. 
0 


We must have 


| men + 3)dz = a 
je D 


5 
V¥29—3 
Therefore, 4m? + 6m —5 = 0 and m = —_ 


14. First, 


1 sl 9 1 
E(XY) = | | xy: =(2x + 3y)dx dy = =. 
0 JO 5 3 
Next, the marginal p.d.f. f; of X was found in Exercise 13(b). Therefore, 


17 


E(X) = [ rfi(e)de = = 


Furthermore, it was found in Exercise 13(a) that E(Y) = 3/5. It follows that Cov(X,Y) = 1/3 — 
(17/30)(3/5) = 17/51 — 17/50 < 0. Therefore, X and Y are negatively correlated. 


15. (a) ForO<a2<1land0<y <1, the conditional p.d-f. of Y given that X = zx is 


_ f(x,y) _ 22x + 8y) 
gly | x) = Gy ae 


When X = 0.8, the prediction of Y is 


' 1 (1.643 
0 0 . 


(b) The marginal p.d.f. of Y is 


1 2 
fal) = [ = (2a + By)der = =(1 + 3y) heb Sa 24. 
0 


Therefore, forO <2 <land0<y <1, the conditional p.d.f. of X given that Y = y is 
f(z,y) _ 2a + 3y 
fo(y) 1+ 3y — 


When Y = 1/3, the prediction of X is the median m of the conditional p.d.f. h(x | y = 1/3). We 
must have 


[e-; 
0 2 ~ 2 


Hence, m? +m = 1 and m = (v5 — 1)/2. 


h(x | y) = 


16. Rather than repeat the entire proof of Theorem 4.7.3 with the necessary changes, we shall merely point 
out what changes need to be made. Let d(X) be a conditional median of Y given X. Replace all 
squared differences by absolute differences. For example [Y — d(X)]? becomes |Y — d(X)]|, [Y — d*(x)]? 
becomes |Y — d*(a)|, and so on. When we refer to Sec. 4.5 near the end of the proof, replace each 
“M.S.E.” by “M.A.E.” and replace the word “mean” by “median” each time it appears in the last four 
sentences. 
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17. Let Z =r(X,Y), and let (X,Y) have joint p.f. f(x,y). Also, let W = r(xo, Y), for some possible value 
xq of X. We need to show that the conditional p.f. of Z given X = zo is the same as the conditional 
p.f. of W given X = 29 for all xo. 


Let fi(x) be the marginal p.f. of X. For each possible value (z,x) of (Z,X), define Biz) = {y : 
r(z,y) =z}. Then, (Z,X) = (z,x) if and only if X =z and Y € By, 2). The joint p-f. of (Z, X) is then 


g(z,z)= DS) f(x,y). 


YEBz,x) 


The conditional p.f. of Z given X = 20 is gi(z|z0) = g(z,20)/fi(xo) for all z and all zo. 


Next, notice that (W,X) = (w,2) if and only if X = x and w € Bywy,g.). The joint p.f. of (W, X) is 
then 


hw,z)= D> f(x,y): 


YEBw,x9) 


The conditional p.f. of W given X = x is hi(w|x) = h(w, x)/fi(x). Now, for x = x, we get hi(w|xo) = 
h(w,xo)/fi(ao). But h(w, xo) = g(w, xo) for all w and all xp. Hence hi(w|zo) = gi(w|zo) for all w and 
all x. This is the desired conclusion. 


4.8 Utility 


Commentary 


It is interesting to have the students in the class determine their own utility functions for any possible gain 
between, say, 0 dollars and 100 dollars; in the other words, to have each student determine their own function 
U(x) for 0 < x < 100. One method for determining various points on a person’s utility function is as follows: 

First, notice that if U(x) is a person’s utility function, then the function V(x) = aU (x) + 6, where a and 
6 are constants with a > 0, could also be used as the person’s utility function, because for any two gambles X 
and Y, we will have E[U(X)] > E[U(Y)] if and only if E[V(X)] = aE[U(X)] +b > E[V(Y)] =aE[U(Y)|+0. 
Therefore, the function V reflects exactly the same preferences as U. The effect of being able to transform 
a person’s utility function in this way by choosing any constants a > 0 and 0b is that we can arbitrarily fix 
the values of the person’s utility function at the two points x = 0 and x = 100, as long as we use values such 
that U(0) < U(100). For convenience, we shall assume that U(0) = 0 and U(100) = 100. 

Now determine a value x; such that the person is indifferent between accepting a gamble from which the 
gain will be either 100 dollars with probability 1/2 or 0 dollars with probability 1/2 and accepting x, dollars 
as a sure thing. For this value 7,;, we must have 


1 it 1 1 
U(X) = 5U (0) + 5U (100) = 50+, 100 =50. 


Hence, U(x) = 50. 

Next, we might determine a value x2 such that the person is indifferent between accepting a gamble from 
which the gain will be either x; dollars with probability 1/2 or 0 dollars with probability 1/2 and accepting 
x dollars as a sure thing. For this value 72, we must have 


i 1 1 i 
U(a2) = gU (a1) + 5U(0) — 5 04+ 5 0 = 25. 


Hence, U(x) = 25. 
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Similarly, we can determine a value x3 such that the person is indifferent between accepting a gamble 
from which the gain will be either x; dollars with probability 1/2 or 100 dollars with probability 1/2 and 
accepting x3 dollars as a sure thing. For this value x3, we must have 


i 1 1 1 
U(x3) = gU (#1) + 5U (100) = 550+ 5: 100 =75. 


Hence, U(x3) = 75. 

By continuing in this way, arbitrarily many points on a person’s utility function can be determined and 
the curve U(x) for 0 < x < 100 can then be sketched. The difficulty is in having the person determine the 
values of 71, 2%2,2%3, etc., honestly and accurately in a hypothetical situation where he will not actually have 
to gamble. For this reason, it is necessary to check and recheck the values that are determined. For example, 
since 


U (m1) = 50 = 5U (02) + (as), 


the person should be indifferent between accepting x; dollars as a sure thing and accepting a gamble from 
which the gain will be either x2 dollars with probability 1/2 or x3 dollars with probability 1/2. By repeat- 
edly carrying out checks of this type and allowing the person to adjust his answers, a reasonably accurate 
representation of a person’s utility function can usually be obtained. 


Solutions to Exercises 


1. The utility of not buying the ticket is U(0) = 0. If the decision maker buys the ticket, the utility is 
U(499) if the ticket is a winner and and U(-—1) if the ticket is a loser. That is the utility is 499° with 
probability 0.001 and it is —1 with probability 0.999. The expected utility is then 0.001 x 499° — 0.999. 
The decision maker prefers buying the ticket if this expected utility is greater than 0. Setting the 
expected utility greater than 0 means 499° > 999. Taking logarithms of both sides yields @ > 1.11. 


2. 
i 2 2 
E[U(X)] = 5-5" + 5: 25° = 325, 
1 
EUU(Y)] = 5° 10? + — +20" = 250, 
Be) 15" 005 
Hence, X is preferred. 
.. 


E(U(X)| = svat 5VB = 3.618, 


E[U(Y)| = 5vi0 + sv = 3.817, 
E[U(Z)] = V15 = 3.873. 


Hence, Z is preferred. 
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4. For any gamble X, E|[U(X)| = aE(X) +0. Therefore, among any set of gambles, the one for which the 
expected gain is largest will be preferred. We have 


i 1 
E(X)=-=-54+-—-25=15, 
(X) ae 
pYy=2 10+: 20 = 15 

_) a a. 
E(Z) = 15. 


Hence, all three gambles are equally preferred. 
5. Since the person is indifferent between the gamble and the sure thing, 


2 2 
OF--1=4. 


1 1 
3 3 3 


U(50) = 5U(0) + =U (100) = 
6. Since the person is indifferent between X and Y, E[U(X)] = E[U(Y)]. Therefore, 
(0.6)U(—1) + (0.2)U(0) + (0.2)U(2) = (0.9)U(0) + (0.1)U (1). 


It follows from the given values that U(—1) = 23/6. 


7. For any given values of a, 
E[U(X)] = plog a+ (1 — p) log(1 — a). 


The maximum of this expected utility can be found by elementary differentiation. We have 


When this derivative is set equal to 0, we find that a = p. Since 


PEUX) p__i-p og 
Oa? ~ a2 (1 —a)? : 


It follows that E[U(X)] is a maximum when a = p. 
8. For any given value of a, 
E[U(X)] = pa? + (1—p)(1— a)”. 
Therefore, 


ON a 
Oa Qat/2 (1 — a)1/2° 


When this derivative is set equal to 0, we find that 


O” B[U(X)] 


Fal <0, it follows that E[U(X)] is a maximum at this value of a. 
a 


Since 


132 Chapter 4. Expectation 


fo) 
ae 


a, 


fo) 


(ii) 


Figure $.4.2: Figure for Exercise 9 of Sec. 4.8. 


9. For any given value of a, 
E{U(X)| = pa + (1 —p)(1 — a). 


This is a linear function of a. If p < 1/2, it has the form shown in sketch (i) of Fig. 5.4.2. 


Therefore, E[U(X)]| is a maximum when a = 0. If p > 1/2, it has the form shown in sketch (ii) of 
Fig. $.4.2. Therefore, E[U(X)] is a maximum when a = 1. If p = 1/2, then E[U(X)] = 1/2 for all 
values of a. 


10. The person will prefer X3 to X4 if and only if 


E[U(X3)| = (0.3)U(0) + (0.3)U(1) + (0.4)U(2) > E[U(X4)] 
= (0.5)U(0) + (0.5)U(2). 


Therefore, the person will prefer X3 to X4 if and only if 
(0.2)U(0) — (0.3)U(1) + (0.1)U(2) < 0. 
Since the person prefers X; to X2, we know that 


E(U(X1)) = 


q 
So 
=e 

q 
co 
+ 

ew 

q 
© 

Vv 
& 
7 
is 


which implies that 
(0.2)U(0) — (0.3)U(1) + (0.1)U(2) < 0. 


This is precisely the inequality which was needed to conclude that the person will prefer X3 to X4. 


dal, 


12: 


13. 


14. 
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For any given value of b, 
E{U(X)] = plog(A + 6) + (1 — p) log(A — 0). 
Therefore, 


OEWU(X)]_ yp i~p 
Ob A+b A-b 


When this derivative is set equal to 0, we find that 
b = (2p —1)A. 


2E[U(X 
Since eee < 0, this value of b does yield a maximum value of E[U(X)]. If p > 1/2, this value of 


b lies between 0 and A as required. However, if p < 1/2, this value of b is negative and not permissible. 
In this case, it can be shown that the maximum value of E[U(X)] for 0 < b < A occurs when b = 0; 
that is, when the person does not bet at all. 


For any given value of b, 


Therefore, 
OBU(X))_ p= 
Ob 2(A+b)'/2 2(A —b)1/2° 
When this derivative is set equal to 0, we find that 
,- B ~-»)? 
et Ota 


As in Exercise 11, if p > 1/2, then this value of b lies in the interval 0 < b < A and will maximize 
E|U(X)]. However, if p < 1/2, the value of b in the interval 0 < b < A for which E[U(X)] is a maximum 
isb=0. 


A. 


For any given value of 6, 


E[U(X)] = p(A + 6) + (1 — p)(A — 8). 


This is a linear function of b. If p > 1/2, it has the form shown in sketch (i) of Fig. $.4.3 and b= A 
is best. If p < 1/2, it has the form shown in sketch (ii) of Fig. $.4.3 and b = 0 is best. If p = 1/2, 
E|U(X)] = A for all values of b. 


For any given value of b, 
B[U(X)] = p(A +5)? + (1 — p)(A — 0). 


This is a parabola in b. If p > 1/2, it has the shape shown in sketch (i) of Fig. $.4.4. Therefore, 
E|U(X)] is a maximum for b = A. If 1/4 < p < 1/2, it has the shape shown in sketch (ii) of Fig. $.4.4. 
Therefore, E[U(X)] is again a maximum for b= A. If 0 < p < 1/4, it has the shape shown in sketch 
(iii) of Fig. $.4.4. Therefore, E[U(X)] is a maximum for b = 0. Finally, if p = 1/4, then it is symmetric 
with respect to the point b = A/2, as shown in sketch (iv) of Fig. $.4.4. Therefore, E[U(X)] is a 
maximum for b= 0 and b= A. 
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(i) (ii) 
Figure $.4.3: Figure for Exercise 13 of Sec. 4.8. 


15. The expected utility for the lottery ticket is 


4o 
a@-1 


4 
EIU (x)|= | asd = 


The utility of accepting xo dollars instead of the lottery ticket is U(xo) = xg. Therefore, the person 
will prefer to sell the lottery ticket for xo dollars if 


a 


ty oF or if 2g > ————. 
a oti ° (a+ 1i/e 
It can be shown that the right-hand side of this last inequality is an increasing function of a. 


16. The expected utility from choosing the prediction d is 
E(U(-[¥ — d)*) = E(\Y — d)). 
We already saw (in Sec. 4.5) that d equal to a median of the distribution of Y minimizes this expectation. 


17. The gain is 10° if P > 1/2 and —10° if P < 1/2. The utility of continuing to promote is then 10°+ 
if P > 1/2 and —10° if P < 1/2. To find the expected utility, we need Pr(P < 1/2). Using the 


1/2 
stated p.d.f. for P, we get Pr(P < 1/2) = | 56p°(1 — p)dp = 0.03516. The expected utility is then 
0 


10°-4 x (1 — 0.03516) — 10® x 0.03516 = 207197. This is greater than 0, so we would continue to promote 
the treatment. 


4.9 Supplementary Exercises 


Solutions to Exercises 


1. Ifu > 0, 


i? xf (x)dx > uf f(x)dx = ull — F(u)]. 
Since 


lim ‘ uf (a)de = B(X) = [ ef la\dn < 20, 


U—->oco —oo —oo 


it follows that 


lim lex) - f° xf(x)da| = Jim, |” xf(@)dr =0. 


U—00 6 
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(i) 


fo) 

DE-- 

oO 

fo) 

= 

ye) 

> }--—-—-—-— 
ion 


(iii) (iv) 


Figure $.4.4: Figure for Exercise 14 of Sec. 4.8. 


2. We use integration by parts. Let u = 1-— F(x) and dv = dz. Then du = —f(x)dz and v = g, and the 
integral given in this exercise becomes 


[uv]o° — i? udu = [ efleyaa = BX 


3. Let 21,2%2,... denote the possible values of X. Since F(X) is a step function, the integral given in 
Exercise 1 becomes the following sum: 


(a4, = 0) + [L—=J(a)| (ea — an) + [l= fe) = Feo) |(es a) 
= aif(t1) +22 f(z2) + a3af(e3) ++: 
F(X). 


4. If X, Y, and Z each had the required uniform distribution, then 


1 113 
B(X +Y¥ + Z)=B(X)+ BY) + B(Z)=54+54+5=5-. 


But since X + Y + Z < 1.3, this is impossible. 
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5. We need E(Y) =au+b=0 and 
VarlY = a'o" = 1, 


Therefore, a = + 4 and b = —ay 


6. The p.d.f. hi(w) of the range W is given at the end of Sec. 3.9. Therefore, 


1 sepa 
E(W) = n(n - yf w"1(1—w)dw = = 


7. The dealer’s expected gain is 


i 7% 3 
BY -X)=55 f [wy —cjededy = 5. 


8. It follows from Sec. 3.9 that the p.d-f. of Y;, is 
gn(y) = n[F(y)]"~* f(y): 
Here, F'(y ae Qedx = y’, 


Galy) = any?! for O<y<l. 


2n 
Qn +1 


1 
Hence, E(Y,) = i Y In(y)dy = 
0 


9. Suppose first that r(X) is nondecreasing. Then 


Pry > ¢r(qn)| = Per Xx) > rlm)) > Prix > m) > 


1 
2 p) 
and 


Pry <rim)| = Prir(X) <7(m)| > Prix < m) S 


Dole 


Hence, r(m) is a median of the distribution of Y. If r(X) is nonincreasing, then 


Pry > r(m)| > Pex < m) => 


wile 


and 


PrlY <r(m)| > Pr(X >m) > 


NI ee 


10. Since m is the median of a continuous distribution, 


PRX <m)=] Pr xX >m). = . Hence, 
Pr(Y¥, >m) = 1—Pr(All Xj s<™m) 
1 
ey eee 


Qn 


i 


12. 


13. 


14. 
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Suppose that you order s liters. If the demand is x < s, you will make a profit of gx cents on the x 
liters sold and suffer a loss of c(s — x) cents on the s — x liters that you do not sell. Therefore, your 
net profit will be gz — c(s —x) =(g+c)a—cs. If the demand is x > s, then you will sell all s liters 
and make a profit of gs cents. Hence, your expected net gain is 


E 


[Mo +oe-eslf(e)de + 9s [ fle)ae 


i. (g+c)x f(x)dz — csF(s) + gs|1 — F(s)]. 


To find the value of s that maximizes FE, we find, after some calculations, that 


= 9-(9+0) F(s). 


Thus, 4 = 0 and E is maximized when s is chosen so that F(s) = g/(g +). 


Suppose that you return at time t. If the machine has failed at time x < t, then your cost is c(t — x). 
If the machine has net yet failed (x > t), then your cost is b. Therefore, your expected cost is 


E= [ c(t — x) f(x)dx + bf f(a)dz = ctF(t) — ef xf(a)de + ol — F(t)]. 


Hence, 


dE 


= = Flt) — BF (0) 


and E will be maximized at a time t such that cF(t) = bf(t). 
E(Z) = 5(3) — 1+ 15 = 29 in all three parts of this exercise. Also, 

Var(Z) = 25 Var(X) + Var(Y) — 10 Cov(X, Y) = 109 — 10 Cov(X,Y). 
Hence, Var(Z) = 109 in parts (a) and (b). In part (c), 

Cov( X,Y) = paxoy = (-25)(2)(3) = 1.5 


so Var(Z) = 94. 


n 
In this exercise, = Yj = Ln — Zo. Therefore, 
j=l 


= 1 
Var(Yn) = —z Var(Xn — Xo). 
n 
Since X,, and Xo are independent, 
Var(Xn — Xo) = Var(Xn) + Var( Xo). 


= Ge? 
Hence, Var(Yn) = = 
n 
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15. 


16. 


17. 


18. 


19. 


20. 
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Leto? = Vari Xy +X) = y Var (X;) + 2), Cov(X;, Xj). In this problem Var(X;) = 0? for all i 
i i<j 
and Cov(X;, X;) = po? for alli 4 j. Therefore, 


v* = no* + n(n — 1)po?. 
Since v? > 0, it follows that p > —1/(n — 1). 


Since the correlation is unaffected by a translation of the distribution of X and Y in the xy-plane, 
we can assume without loss of generality that the origin is at the center of the rectangle. Hence, by 
symmetry, E(X) = E(Y) = 0. But it also follows from symmetry that E(XY) = 0 because, for any 
positive value of XY in the first or third quadrant, there is a corresponding negative value in the second 
or fourth quadrant with the same constant density. Thus, Cov(X,Y) = 0 and p(X, Y) =0. 


More directly, one can argue that the joint p.d.f. of (X,Y) factors into constants times the indicator 
functions of the two intervals that define the sides of the rectangles, hence X and Y are independent 
and uncorrelated. 


For 7=1,...,n, let X; = 1 if the ith letter is placed in the correct envelope and let X; = 0 otherwise. 
Then E£(X;) = 1/n and, fori 4 J, 
1 
E(X;X;) = Pr(X;X; = 1) = Pr(X; = 1 and X; = 1) > 
n(n — 1) 
Also, E(X?) = E(X;) =1/n. Hence, 
1 1 n—-1 
Var (X;) = a = ne = ne 
1 1 1 : i 
and Cov(X;, X;) = Ay we = Ga The total number of correct matches is X = S7i_, Xj. 
Therefore, 
= n-1 i 
Var(X) = 2 eS) + Ze Cov(X;, X;) =n- as n(n —1)- a= =i 


E((X — u)] = E(X) — 3uB(X?) +37 B(X) - p° 
— 3p(o? + py?) + 3p? — 3 


ly 2 ana ettey — BONO — WOR 
as oO = TWOP 
Since ¥(0) = 1, w'(0) =p, and W"(0) = E(X?) = 07 + p’, it follows that ¢(0) = pu and e"(0) =o. 


It was shown in Exercise 12 of Sec. 4.7 that if E(Y | X) =aX +6, then 


_ Cov(X,Y) _ poy 


Var(X) Bx 


and b = wy — ax. The desired result now follows immediately. 
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21. Since the coefficient of X in E(Y | X) is negative, it follows from Exercise 20 that p < 0. Furthermore, 
it follows from Exercise 20 that the product of the coefficients of X and Y in E(Y | X) and E(X | Y) 
must be p”. Hence, p? = 1/4 and, since p < 0, p = —1/2. 


22. Let X and Y denote the lengths of the longer and shorter pieces, respectively. Since Y = 3 — X with 
probability 1, it follows that p = —1. 


23. 
Cov(X,X+bY) = Var(X) + bCov(X,Y) 
= 1+0p. 
Var(X) = 1, Var(X + bY) =1+5b? + 2bp. 
Hence, 
1+ bp 


2402 
ia aE T Ul 


If we set this quantity equal to p, square both sides, and solve for b, we obtain b = —1/(2p). 


24. The p.f. of the distribution of employees is 
#0) =, FQ) =.2, FB) =.3, and f(5) = 4 


(a) The unique median of this distribution is 3, so the new office should be located at the point 3. 


(b) The mean of this distribution is (.1)(0) + (.2)(1) + (.3)(3) + (.4)(5) = 3.1, so the new office should 
be located at the point 3.1. 


25. (a) The marginal p.d.f. of X is 
fi(z) -| 8ay dy = 42° forO<a<l. 
0 


Therefore, the conditional p.d.f. of Y given that X = .2 is 
f(2,y) 
fi (2) 


The mean of this distribution is 


2 
E(Y | X = .2) = = = .1333. 


1/2 
(b) The median of gi(y | X = .2) is m= (+) — 1414. 


oly |xX=2) = =p0y tor Ug? 


26. 


Cov(X, Y) = E[(X — px)(Y — py)] 
= E{[X—E(X | Z)+E(X|Z)— px] -[Y- EY | Z)+ EY |Z) — py]} 
E{|X — BC | ZY —- BY | 2)} +A X-BOX | 2) BO | 2) pel} 
+E{[E(X | Z) — wx][¥ — EY | Z)]} + E{[E(X | Z) — wx|[E(Y | Z) - py]}. 


Consider these final four expectations. In the first one, if we first calculate the conditional expectation 
given Z and then take the expectation over Z we obtain E[Cov(X,Y | Z)]. In the second and third 
expectations, we obtain the value 0 when we take the conditional expectation given Z. The fourth 
expectation is Cov[E(X | Z), E(Y | Z)]. 
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27. Let N be the number of balls in the box. Since the proportion of red balls is p, there are Np red balls 


28. 


29. 


in the box. (Clearly, p must be an integer multiple of 1/N.) There are N(1 — p) blue balls in the box. 
Let kK = Np so that there are N — K blue balls and K red balls. If n > K, then Pr(Y =n) = 0 since 
there are not enough red balls. Since Pr(X =n) > 0 for all n, the result is true ifn > K. Forn< Kk, 
let X; = 1 if the 7th ball is red for i=1,...,n. For sampling without replacement, 


a KEA Kyi 
Pr(Y¥ =n) = Pr(Xy =1) J] Pr(X; = 1X1 =1,..., X14 = 1) = -—-. - — 


———.. (8.4.4 
NN-1 N-n+1 en 


For sampling with replacement, the X;’s are independent, so 


Pr(X =n) = Mes — (=). (8.4.5) 


N 
For j =1,...,.n-—1, KN-JjN < KN—jK, so (K —j)/(N —37) < K/N. Hence the product in (S.4.4) 
is smaller than the product in (S.4.5). This argument makes sense only if N is finite. If N is infinite, 
then sampling with and without replacement are equivalent. 


The expected utility from the gamble X is E[U(X)] = E(X?). The utility of receiving E(X) is 


U[E(X)] = [E(X)]?.. We know from Theorem 4.3.1 that E(X?) > [E(X)]? for any gamble X, and 
from Theorem 4.3.3 that there is strict inequality unless X is actually constant with probability 1. 


The expected utility from allocating the amounts a and m — a is 


E plog(gia) + (1 — p) log[g2(m — a)] 
= plog a+(1-—p)log(m—a) 


+ plog 9 + (1—p)log go. 


The maximum over all values of a can now be found by elementary differentiation, as in Exercise 7 of 
Sec.4.8, and we obtain a = pm. 
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Special Distributions 


5.2 The Bernoulli and Binomial Distributions 


Commentary 


If one is using the statistical software R, then the functions dbinom, pbinom, and qbinom give the p.f., the 
c.d.f., and the quantile function of binomial distributions. The syntax is that the first argument is the 
argument of the function, and the next two are n and p respectively. The function rbinom gives a random 
sample of binomial random variables. The first argument is how many you want, and the next two are n and 
p. All of the solutions that require the calculation of binomial probabilites can be done using these functions 
instead of tables. 


Solutions to Exercises 


1. Since E(X 3 ) has the same value for every positive integer k, we might try to find a random variable 
X such that X, X?, X°, X4,...all have the same distribution. If X can take only the values 0 and 1, 
then X* = X for every positive integer k since 0* = 0 and 1* = 1. If Pr(X =1) =p=1—Pr(X =0), 
then in order for E(X*) = 1/3, as required, we must have p= 1/3. Therefore, a random variable X 
such that Pr(X = 1) = 1/3 and Pr(X = 0) = 2/3 satisfies the required conditions. 


2. We wish to express f(x) in the form p* )(1 — p)®@), where a(x) = 1 and 6(x) = 0 and x = a and 
a(x) = 0 and f(x) = 1 for x = b. If we choose a(x) and {(x) to be linear functions of the form 
a(x) = a, + a9 and B(x) = 2) + fox, then the following two pairs of equations must be satisfied: 


aj taza = 1 
ajt+tagb = 0, 


and 
Pi + Poa = 0 
Pi+Pob = 1 
Hence, 
_ _ 1 
oo a—b’ Oe aed 
a 1 
Bi = hag a= 
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3. Let X be the number of heads obtained. Then strictly more heads than tails are obtained if X € 
{6,7,8,9,10}. The probability of this event is the sum of the numbers in the binomial table corre- 
sponding to p = 0.5 and n = 10 for k = 6,...,10. By the symmetry of this binomial distribution, we 
can also compute the sum as (1 — Pr(X = 5))/2 = (1 — 0.2461) /2 = 0.37695. 


4. It is found from a table of the binomial distribution with parameters n = 15 and p = 0.4 that 


Pr(6<X <9) = Pr(X =6)4+Pr(X =7)+Pr(X = 8) + Pr(X = 9) 
.2066 + .1771 + .1181 + .0612 = .5630. 


5. The tables do not include the value p = 0.6, so we must use the trick described in Exercise 7 of 
Sec. 3.1. The number of tails X will have the binomial distribution with parameters n = 9 and p = 0.4. 
Therefore, 


Pr(Even number of heads) = Pr(Odd number of tails) 

Pr(X = 1) + Pr(X = 3) 4+ Pr(X =5) + Pr(X = 7) + Pr(X =9) 
.0605 + .2508 + .1672 + .0212 + .0003 

0000. 


6. Let N4, Np, and Noe denote the number of times each man hits the target. Then 


E(Na+Ne+No) = E(Na)+ E(Ng)+ E(No) 
1 it i. 

= $:24682 4322 =. 

gt ra aR 


7. If we assume that N4, Ng, and No are independent, then 
Var(Na+Ne+Nce) = Var(Na) + Var(Ng) + Var(Nc) 


= 3-5-:54+5----42 


8. The number X of components that fail will have the binomial distribution with parameters n = 10 and 
p= 0.2. Therefore, 


perctine) = @S2) 22 eae 


Pr(X > 1) 1—Pr(X = 0) 
1—.1074— .2684 6242 

= 2S = 0 
1 — .1074 .8926 


. Pr (x1 =1 ad yx =4) Pr (2 = 1 and yx= 8-1) 
Pr (32x = Pi (sox 7 ; 
i=1 i=l 


bees —k w=1 1=2 


10. 
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Since the random variables Xj,..., X, are independent, it follows that X, and >i, X; are independent. 
Therefore, the final expression can be rewritten as 


1=2 


The sum )7/_, X; has the binomial distribution with parameters n —1 and p, and the sum )7/_, X; has 
the binomial distribution with parameters n and p. Therefore, 


Pr (>: X; = k —_— : = (i i) = pire) = i i) ta ae) a 


1=2 


Pr bs x; = 7 = (iota a) ae 
i=1 


Also, Pr(X; = 1) = p. It now follows that 


Pr(X4 = 1) Pr (sox =k—- : 


and 


Sx, -*) 2 (7 }eta-ar Z 


(7)ota —p)r* 


The number of children X in the family who will inherit the disease has the binomial distribution with 
parameters n and p. Let f(x|n,p) denote the p.f. of this distribution. Then 


Pr (x =1 


i=l 


Pr(X > 1) =1—Pr(X =0) =1- f(0|n,p) =1—-(1—-p)”. 


For t= 1,2,...,n 
Pre = aX 21] 


Therefore, the conditional p.f. of X given that X > 1is f(x|n,p)/(1—[1—p]") for r =1,2,...,n. The 
required expectation E(X | X > 1) is the mean of this conditional distribution. Therefore, 


. fee . 
E(Xx|X >= =e = nat (x|n, p). 


However, we know that the mean of the binomial distribution is np; i.e., 
nm 
E(X) = 5° xf(x|n,p) = np. 
«z=0 


Furthermore, we can drop the term corresponding to z = 0 from this summation without affecting the 
n 


value of the summation, because the value of that term is 0. Hence, y; xf (x|n,p) = np. It now follows 
z=1 
that E(X|X > 1) =np/(1 — [1 —p)”). 


144 


ab 


12. 


13. 


14. 


Chapter 5. Special Distributions 


Since the value of the term being summed here will be 0 for « = 0 and for « = 1, we may change 
the lower limit of the summation from x = 2 to x = 0, without affecting the value of the sum. The 
summation can then be rewritten as 


s e ("ora —p)"* — 3 o(")ora ye 


If X has the binomial distribution with parameters n and p, then the first summation is simply E(X?) 
and the second summation is simply E(X). Finally, 


E(X*) — E(X) = Var(X) + [E(X))? — E(X) = np(1 — p) + (np)? — np = n(n — 1)p?. 


Assuming that p is not 0 or 1, 


Therefore, 


f(e+1|n,p) 


>1 if and only ifa < (n+1)p—-1. 
Felmp) vere 


It follows from this relation that the values of f(x|n,p) will increase as x increases from 0 up to the 
greatest integer less than (n+ 1)p, and will then decrease as x continues increasing up to n. Therefore, 
if (n+ 1)p is not an integer, the unique mode will be the greatest integer less than (n+ 1)p. If (n+ 1)p 
is an integer, then both (n+ 1)p and (n+ 1)p— 1 are modes. If p = 0, the mode is 0 and if p = 1, the 
mode is n. 


Let X be the number of successes in the group with probability 0.5 of success. Let Y be the number 
of successes in the group with probability 0.6 of success. We want Pr(X > Y). Both X and Y have 
discrete (binomial) distributions with possible values 0,...,5. There are 36 possible (X,Y) pairs and 
we need the sum of the probabilities of the 21 of them for which X > Y. To save time, we shall calculate 
the probabilities of the 15 other ones and subtract the total from 1. Since X and Y are independent, 
we can write Pr(X = 2, Y = y) = Pr(X = x) Pr(Y =), and find each of the factors in the binomial 
table in the back of the book. For example, for x = 1 and y = 2, we get 0.1562 x 0.2304 = 0.03599. 
Adding up all 15 of these and subtracting from 1 we get 0.4957. 


Before we prove the three facts, we shall show that they imply the desired result. According to (c), 
every distribution with the specified moments must take only the values 0 and 1. The mean of such a 
distribution is Pr(X = 1). This number, Pr(X = 1), uniquely determines every distribution that can 
only take the two values 0 an 1. 


(a) Suppose that Pr(|X| > 1) > 0. Then there exists € > 0 such that Pr(|X| > 1+.) > 0. Then 
EOS”) Sey Pr) | > 1): 


Since the right side of this equation goes to oo as k — oo, it cannot be the case that E(X?*) = 1/3 
for all k. This contradiction means that our assumption that Pr(|X| > 1) > 0 must be false. That 
is, Pr |x| 11, 
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(b) Since X* < X? whenever |X| < 1 and X? ¢ {0,1}, it follows that E(X*) < E(X?) whenever 
Pr(|X| < 1) = 1 and Pr(X? ¢ {0,1}) > 0. Since we know that E(X*) = E(X?) and Pr(|X| < 
1) = 1, it must be that Pr(X? ¢ {0,1}) =0. That is, Pr(X? € {0,1}) =1. 
(c) From (b) we know that Pr(X € {—1,0,1}) = 1. We also know that 
E(X) = Pr(X =1)-Pr(X = -1) 
E(X*) = Pr(X =1)+Pr(X = —-1). 
Since these two are equal, it follows that Pr(X = —1) =0. 


15. We need the maximum number of tests if and only if every first-stage and second-stage subgroup has 
at least one positive result. In that case, we would need 10 + 100 + 1000 = 1110 total tests. The 
probability that we have to run this many tests is the probability that every Yo; = 1, which in turn 
is the probability that every Zo, > 0. The 2;%’s are independent binomial random variables with 
parameters 10 and 0.002, and there are 100 of them altogether. The probability that each is positive 
is 0.0198, as computed in Example 5.2.7. The probability that they are all positive is (0.0198)!00 = 
464% 107", 


16. We use notation like that in Example 5.2.7 with one extra stage. For 7 = 1,...,5, let 2, be the 
number of people in group i who test positive. Let Yj, = 1 if 2, > 0 and Y;,; = 0 if not. Then Z,; 
has the binomial distribution with parameters 200 and 0.002, while Y;,; has the Bernoulli distribution 
with parameter 1 — 0.9987°° = 0.3299. Let Z2,i,, be the number of people who test positive in the kth 
subgroup of group 7 for k = 1,...,5. Let Yo;, = 1 if Zoi, > 0 and Yo;, = 0 if not. Each Zo; has 
the binomial distribution with parameters 40 and 0.002, while Y2;, has the Bernoulli distribution with 
parameter 1 — 0.99840 = 0.0770. Finally, let Z3,i,k,j be the number of people who test positive in the 
jth sub-subgroup of the kth subgroup of the 7th group. Let Y3; 4; = 1 if Z3 in; > 0 and Y3i nj; = 0 
otherwise. Then 23;,,; has the binomial distribution with parameters 8 and 0.002, while Y3; 4; has 
the Bernoulli distribution with parameter 1 — 0.998° = 0.0159. 


The maximum number of tests is needed if and only if there is at least one positive amongst every one of 
the 125 sub-subgroups of size 8. In that case, we need to make 1000+125+25+5 = 1155 total tests. Let 
Yi, = 4 Y,,;, which is the number of groups that need further attention. Let Yo = as yy Yo i,k 
which is the number of subgroups that need further attention. Let Y3 = a oo a Y34,k,j3, Which 
is the number of sub-subgroups that need all 8 members tested. The actual number of tests needed is 
Y =5+5Y,+5Y2+8Y3. The mean of Y; is 5 x 0.3299 = 1.6497. The mean of Yo is 25 x 0.0770 = 1.9239. 
The mean of Y3 is 125 x 0.0159 = 1.9861. The mean of Y is then 


E(Y) =5+5 x 1.6497 + 5 x 1.9239 + 8 x 1.9861 = 38.7569. 


5.3. The Hypergeometric Distributions 


Commentary 


The hypergeometric distribution arises in finite population sampling and in some theoretical calculations. 
It actually does not figure in the remainder of this text, and this section could be omitted despite the fact 
that it is not marked with an asterisk. The section ends with a discussion of how to extend the definition of 
binomial coefficients in order to make certain formulas easier to write. This discussion is not central to the 
rest of the text. It does arise again in a theoretical discussion at the end of Sec. 5.5. 

If one is using the statistical software R, then the functions dhyper, phyper, and qhyper give the p.f., 
the c.d.f., and the quantile function of hypergeometric distributions. The syntax is that the first argument is 
the argument of the function, and the next three are A, B, and n in the notation of the text. The function 
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rhyper gives a random sample of hypergeometric random variables. The first argument is how many you 
want, and the next three are A, B, and n. All of the solutions that require the calculation of hypergeometric 
probabilites can be done using these functions. 


Solutions to Exercises 


1. Using Eq. (5.3.1) with the parameters A = 10, B = 24, and n = 11, we obtain the desired probability 


10\ (24 
10 1 8 
Pr(X = 10) = ~~ _. = 8.389 x 107°. 
34 
11 
2. Let X denote the number of red balls that are obtained. Then X has the hypergeometric distribution 
with parameters A = 5, B = 10, and n = 7. The maximum value of X is min{n, A} = 5, hence, 


() ( 7 
a ea] =< 2745 
P(X >3)= 50 4 __8 0,4060. 
= (*) 6435 
7 


3. As in Exercise 2, let X denote the number of red balls in the sample. Then, by Eqs. (5.3.3) and (5.3.4), 


nA 7 nAB A+B-n 8 
—= =_l XxX SS SS 
Wg 9? VCs aaa 
Since X = X/n, 
= 2Fe\=2 at Wes ee 
( a = 5 an ar — ar = A 


4. By Eq. (5.3.4), 
Var(X) = Je). n(28 — n). 


The quadratic function n(28 — n) is a maximum when n = 14. 


5. By Eq. (5.3.4), 
Var(X) = ————_ n(T — n). 


If T is an even integer, then the quadratic function n(T — n) is a maximum when n = T/2. If T is 
an odd integer, then the maximum value of n(T — n), for n = 0,1,2,...,7, occurs at the two integers 
(T —1)/2 and (7 +1)/2. 
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6. For x =0,1,...,k, 


Pr(X; =a and X,+X,=k) Pr(X, =x and X29 =k-—-2z) 
Pr(X, = 2|X, + Xo = k) = — 3 
rc a Pr(X + Xy =f) Pr(X1 + Xp =f) 


Since X; and X92 are independent, 
Prae2 and As =k = 2) = Prix) = &) Pri xo = k= &): 


Furthermore, it follows from a result given in Sec. 5.2 that X, + X2 will have the binomial distribution 
with parameters nj +n and p. Therefore, 


Pr Ay =z). = @la py, 
xv 
Pr(X2 =k—- it) = fe ta _ p)yr2—kte 
=i 
Pr(X1 +X9= k) = (" 7c) a _ pyre, 
By substituting these values into the expression given earlier, we find that for z = 0,1,...,k, 
Hy A ae 
Pr(X1 = 2|X1 + X2 = k) = 7A 
ny + ng 
k 


It can be seen that this conditional distribution is a hypergeometric distribution with parameters 71, na, 
and k. 


7. (a) The probability of obtaining exactly x defective items is 
0.37 0.77 
x 10-2 
7 : 
10 


Therefore, the probability of obtaining not more than one defective item is the sum of these 
probabilities for = 0 and x = 1. 


Since 


OL 3 
& )=1 and & )=oar 


this sum is equal to 
0.77 0.77 
0.37 
iz : 
10 
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(b) The probability of obtaining exactly x defectives according to the binomial distribution, is 
10 
Boye 
(" (0.3)"(0.7) 
The desired probability is the sum of these probabilities for « = 0 and x = 1, which is 
(0.7)1° + 10(0.3)(0.7)°. 


For a large value of T, the answers in (a) and (b) will be very close to each other, although this 
fact is not obvious from the different forms of the two answers. 


8. If we let X; denote the height of the ith person selected, for i = 1,...,n, then X = X,+--:-+ Xp. 


Furthermore, since X; is equally likely to have any one of the T values a1,...,a7, then 
1 LT 
B(Xy)\ = Fou =i 
i=1 
and 
1 T 
Var( X;) = 7 So (ai =pPr=o 
i=1 


It follows that F(X) = ny. Furthermore, by Theorem 4.6.7, 
Var(X) = 57 Var(X;) +2 » > Cov(X;, Xj). 
i=1 i<j 
Because of the symmetry among the variables X1,...,X;,, it follows that 
Var(X) = no? + n(n — 1) Cov(X1, Xo). 
We know that Var(X) = 0 for n = T. Therefore, 


Cov(X1, X2) = 


T-1 


It now follows that 
Var(X) = no? — noe = no” (= — *) 
9. By Eq. (5.3.14), 


(°7) _ (8/2)(1/2)(-1/2)(-3/2) _ 3 
4 


4! 128° 
10. By Eq. (5.3.14), 


fe (=n)(-n —1)---(-n-k+1) _ (-)A(m)(n +1)... +k -1) 
k; 


k} k! 
If we reverse the order of the factors in the numerator, we can rewrite this relation as follows: 


(-p) =Car be int ha) ye(m re) 


kt} kl k 
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11. Write (1 + a,)"e~%"™ = exp|c, log(1 + an) — ancy]. The result is proven if we can show that 


[cn log(1 + an) — Gncy] = 0. (8.5.1) 


lim 
N—- oo 
Use Taylor’s theorem with remainder to write 


2 
a 
leat ie ee 
og( + Gn) An 2(1 + Yn)?’ 


where yy, is between 0 and a,. It follows that 


2 2 
Cn an Cy @ 
Cn log(1 + iy) AnCyn = CpAn — Tea — AnCn = - a n 7 rE 


We have assumed that c,a2 goes to 0. Since yp, is between 0 and an, and ay, goes to 0, we have 
1/[2(1 + yn)?] goes to 0. This establishes (8.5.1). 


5.4 The Poisson Distributions 


Commentary 


This section ends with a more theoretical look at the assumptions underlying the Poisson process. This 
material is designed for the more mathematically inclined students who might wish to see a derivation of the 
Poisson distribution from those assumptions. Such a derivation is outlined in Exercise 16 in this section. 

If one is using the statistical software R, then the functions dpois, ppois, and qpois give the p.f., the 
c.d.f., and the quantile function of Poisson distributions. The syntax is that the first argument is the argument 
of the function, and the second is the mean. The function rpois gives a random sample of Poisson random 
variables. The first argument is how many you want, and the second is the mean. All of the solutions that 
require the calculation of Poisson probabilites can be done using these functions instead of tables. 


Solutions to Exercises 


1. The number of oocysts X in t = 100 liters of water has the Poisson distribution with mean 0.2 x 0.1 x 
100 = 2. Using the Poisson distribution table in the back of the book, we find 


Pr(X > 2) =1—Pr(X <1) =1—0.1353 — 0.2707 = 0.594. 


2. From the table of the Poisson distribution in the back of the book it is found that 


Pr(X > 3) = .0284 + .0050 + .0007 + .0001 + .0000 = .0342. 


3. Since the number of defects on each bolt has the Poisson distribution with mean 0.4, and the observa- 
tions for the five bolts are independent, the sum for the numbers of defects on five bolts will have the 
Poisson distribution with mean 5(0.4) = 2. It is found from the table of the Poisson distribution that 


Pr(X > 6) = .0120 + .0034 + .0009 + .0002 + .0000 = .0165. 


There is some rounding error in this, and 0.0166 is closer. 
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A. If f(x| A) is the p-f. of the Poisson distribution with mean A, then 
Pr(X = 0) = f(0|A) = exp(—A). 


5. Let Y denote the number of misprints on a given page. Then the probability p that a given page will 
contain more than k& misprints is 


p=PiY Sh) = ore j~= oo 


i=k+1 i=k+1 


Therefore, 


1—-p= wi @lAj= ee 


i=0 


Now let X denote the number of pages, among the n pages in the book, on which there are more than 
k misprints. Then for « = 0,1,...,n, 


and 
Pr X Sin) = ry ("ora —p)*, 


6. We shall assume that defects occur in accordance with a Poisson process. Then the number of defects 
in 1200 feet of tape will have the Poisson distribution with mean pp = 3(1.2) = 3.6. Therefore, the 
probability that there will be no defects is exp(—j) = exp(—3.6). 


7. We shall assume that customers are served in accordance with a Poisson process. Then the number of 
customers served in a two-hour period will have the Poisson distribution with mean p = 2(15) = 30. 
Therefore, the probability that more than 20 customers will be served is 


‘ exp(—30)(30)"_ 


Pr(X > 20) = . 
av 


r=21 
8. For « = 0,1,...,k, 
padi ee he ee) _ ea ee) 
Pr(Xy + Xo = k) Pr(Xy + X59 = k) 


Since X; and X2 are independent, 
Pr(X, = a and Xy =k-—z) =Pr(X, =z) Pr(Xo =k —2). 


Also, by Theorem 5.4.4 the sum X, + X9 will have the Poisson distribution with mean A, + Ag. Hence, 


exp(—A1) At 
x! 
exp(—Az) AR-® 
(k— <x)! 
exp(—(A1 + A2))(A1 + 2)* 
k! 
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It now follows that 


ki! M1 : r2 i k Ha k-2 
PM 21h += aE Ge) Ga) 7 (e)re-e 


where p = Ay /(A, + A2). It can now be seen that this conditional distribution is a binomial distribution 
with parameters k and p = \1/(A1 + Ag). 


. Let N denote the total number of items produced by the machine and let X denote the number of 
defective items produced by the machine. Then, for 7 = 0,1,..., 


Pr xX =z) = S> Pr(X = 2|N =n)Pr(N =n). 
n=0 


Clearly, it must be true that X < N. Therefore, the terms in this summation for n < x will be 0, and 
we may write 
PA =a) = S> Pr(X =9|N =n) Prin =): 
n=2X 


Clearly, Pr(X = 0|N = 0) =1. Also, given that N = n > 0, the conditional distribution of X will be 
a binomial distribution with parameters n and p. Therefore, 


! 
Pr(X = «|N =n) =———_p*(1 —p)"*. 


x(n — x)! 


Also, since N has the Poisson distribution with mean 4, 


Tr: 
Hence, 
= n xc n-x exp(—A)A” x ~ 1 n-£\n 
Pr(X =z) = »» a ee ee 2s oe 


If we let t =n — <2, then 


Pr(X=2) = <pexp(-) 5 (1 -p)'a 
=U 
_— —(Ap)* eS S 
t=0) . 


_ exp(=Ap) Ap)" 


~ 0)" exp(—A) exp(A(1 — p)) x! 


It can be seen that this final term is the value of the p.f. of the Poisson distribution with mean Ap. 
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It must be true that X + Y = N. Therefore, for any nonnegative integers x and y, 


Pr(X =xandY=y) = Pr(iX=acandN=2+y) 
= Pr(X=2|N=a+y)Pr(N=2+y) 


x ! exp(—A)A?t 
- plore 
= op 2 BORD 


The fact that we have factored Pr(X = x and Y = y) into the product of a function of x and a function 
of y is sufficient for us to be able to conclude that X and Y are independent. However, if we continue 
further and write 


exp(—A) = exp(—Ap) exp(—A(1 — p)) 
then we can obtain the factorization 


exp(—A)p(Ap)"__ exp(—AQ — p) [AG = py 


PRA =vand Y =y) = “I a 


If f(x | A) denotes the p.f. of the Poisson distribution with mean 4, then 


f(e@+1]A) A 


f(z|A) x41 


Therefore, f(x| A) < f(a+1] A) if and only if +1 < X. It follows that if \ is not an integer, then the 
mode of this distribution will be the largest integer x that is less than A or, equivalently, the smallest 
integer x such that + 1> 4X. If A is an integer, then both the values A — 1 and 4 will be modes. 


It can be assumed that the exact distribution of the number of colorblind people in the group is a 
binomial distribution with parameters n = 600 and p = 0.005. Therefore, this distribution can be 
approximated by a Poisson distribution with mean 600(0.005) = 3. It is found from the table of the 
Poisson distribution that 


Pr(X <1) = .0498 + .1494 = .1992. 


It can be assumed that the exact number of sets of triplets in this hospital is a binomial distribution 
with parameters n = 700 and p = 0.001. Therefore, this distribution can be approximated by a Poisson 
distribution with mean 700(0.001) = 0.7. It is found from the table of the Poisson distribution that 


Pr(X = 1) = 0.3476. 


Let X denote the number of people who do not appear for the flight. Then everyone who does appear 
will have a seat if and only if X > 2. It can be assumed that the exact distribution of X is a binomial 
distribution with parameters n = 200 and p = 0.01. Therefore, this distribution can be approximated 
by a Poisson distribution with mean 200(0.01) = 2. It is found from the table of the Poisson distribution 
that 


Pr(X > 2) =1—Pr(X <1) =1—.1353 — .2707 = 5940. 
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15. The joint p.f./p.d.f. of X and A is the Poisson p.f. with parameter \ times f(A) which equals 
A* a 
exp(—A)— 2exp(—2d) = 2exp(—3\)—. (S.5.2) 
x! x! 


We need to compute the marginal p.f. of X at x = 1 and divide that into (S.5.2) to get the conditional 

p.d.f. of A given X = 1. The marginal p.f. of X at « = 1 is the integral of (S.5.2) over A when z = 1 is 
plugged in. 

= 2 

fAi(1) =a 2rexp(—BA)dd = =, 

0 


This makes the conditional p.d.f. of A equal to 9\ exp(—3A) for A > 0. 
16. (a) Let A= Uf, A;. Then 
{(A=—kh = AX —e NAVEL Hk NA”). 


The second event on the right side of this equation is {W,, = k}. Call the first event on the right 
side of this equation B. Then B Cc A. Since B and {W,, = k} are disjoint, Pr(X = k) = Pr(W,, = 
k) + Pr(B). 


(b) Since the subintervals are disjoint, the events Aj,...,A, are independent. Since the subintervals 
all have the same length t/n, each A; has the same probability. It follows that 


Pr(nt,4%) = [1 = Pr( Ay)", 
By assumption, Pr(A;) = o(1/n), so 

Pr(A) = 1 — Pr (Mj, Af) = 1 - [1 — o(1/n))”. 
So, 

Jim, Pr(A) =1—- Jim [1 —o(1/n)]” = 1, 
according to Eq. (5.4.9). 


(c) Since the Y; are i.i.d. Bernoulli random variables with parameter p, = At/n + o(1/n), we know 


that W,, has the binomial distribution with parameters n and p,. Hence 
n n) At : At an 
Pr(W, =k) = k= par =— | 1 | 1-= - 1 ; 


For fixed k, 
Mt : 
. k _ k 
Jim n = + 0(1/n)| = (AE): 


Also, using the formula stated in the exercise, 


It follows that 
Ct). n! 
Kl n-360 nk(n —k)! 


lim Pr(W,, = k) = exp(—At) 


n—->co 


We can write 
n! n(n —1)---(n-—k+1) 


nk(n—k)! nnn 


For fixed k, the limit of this ratio is 1 as n > oo. 
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(d) We have established that 
Pr(X =k) = Pr(W,, = &) + Pr(’). 
Since the left side of this equation does not depend on n, we can write 
Px =kh)l= Jim, Pr(W,, = k) + Jim, Pr(B). 


In earlier parts of this exercise we showed that the two limits on the right are exp(—At)(At)*/k! 
and 0 respectively. So, X has the Poisson distribution with mean At. 


17. Because npAr/(Ar + Br) converges to A, n7/(Ar + Br) goes to 0. Hence, Br eventually gets larger 
than np. Once Br is larger than np + x and Ar is larger than x, we have 
A B 
hoa a ae _ Ar!Brlnr\(Ar + Br — nr)! 
. (ArtBr)  al(Ar — 2) (nr — 2) (Br — np + 2)!(Ar + Br)! 
Apply Stirling’s formula to each of the factorials in the above expression except x!. A little manipulation 
gives that 


fim —— A Pn Par + Brame )Ar Prone 
P00 Pr(Xp = x) al Ap —a)4t-84 4/2 (np — ag)! (Bp — nt a) Pr ete Apt Br Arter 
== (S.5.3) 


Each of the following limits follows from Theorem 5.3.3: 


: Ar Ayp—2x41/2 
pe (= = =) 


— e 5 
Br-—nr+a+1/2 
lim wos ) ae = gg, 
Too \Br-nrt+2z 
Fig a i a 
T>00 Ar + Br : 
np—x+1/2 
lim ( es ) =o. 
T00 \n7r — z& 
Br NT —-x£@ as 
] = 
T $00 (x + mm) = 
Inserting these limits in (S.5.3) yields 
At erne 
lim ————"_? ____ — 8.5.4 
T-s00 Pr(Xp = x)al(Ar + Br)® Om 
Since nrpAr/(Ar + Br) converges to 4, we have 
Trp 
lim ———=—— = ”. 8.5.5 
1, ip + BrP a 
Together (8.5.4) and (S.5.5) imply that 
Lp—A / pl 
pe a 


T00 Pr(Xp = 2) 


The numerator of this last expression is Pr(Y = x), which completes the proof. 
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18. First write 


npAr np Ar np Az, Ar Ar 
Br Ar+Br Br(Ar+ Br) Br Ar+ Br 


For the “if” part, assume that nrAr/Br converges to X. Since np goes to oo, then Ar/Br goes to 0, 
which implies that Ar/(Ar + Br) (which is smaller) goes to 0. In the final expression in (8.5.6), 
the product of the first two factors goes to A by assumption, and the third factor goes to 0, so 
npAr/(Ar + Br) converges to the same thing as nrAr/Br, namely X. For the “only if” part, as- 
sume that npAr/(Ar + Br) converges to X. It follows that Ar/(Ar + Br) = 1/(1+ Br/Ar) goes to 
0, hence A7/Br goes to 0. In the last expression in (S.5.6), the product of the first and third factors 
goes to A by assumption, and the second factor goes to 0, hence nr A7/Br converges to the same thing 
as npAr/(Ar + Br), namely 4. 


5.5 The Negative Binomial Distributions 


Commentary 


This section ends with a discussion of how to extend the definition of negative binomial distribution by 
making use of the extended definition of binomial coefficients from Sec. 5.3. 

If one is using the statistical software R, then the functions dnbinom, pnbinom, and qnbinom give the 
p-f., the c.d.f., and the quantile function of negative binomial distributions. The syntax is that the first 
argument is the argument of the function, and the next two are r and p in the notation of the text. The 
function rnbinom gives a random sample of binomial random variables. The first argument is how many you 
want, and the next two are r and p. All of the solutions that require the calculation of negative binomial 
probabilites can be done using these functions. There are also functions dgeom, pgeom, qgeom, and rgeom 


bbe? 


that compute similar features of geometric distributions. Just remove the “r” argument. 


Solutions to Exercises 


1. (a) Two particular days in a row have independent draws, and each draw has probability 0.01 of 
producing triples. So, the probability that two particular days in a row will both have triples is 
1o~*, 


(b) Since a particular day and the next day are independent, the conditional probability of triples on 
the next day is 0.01 conditional on whatever happens on the first day. 


2. (a) The number of tails will have the negative binomial distribution with parameters r = 5 and 
p = 1/30. By Eq. (5.5.7), 


rip) 


E(X) = = 5(29) = 145. 
(b) By Eq. (5.5.7), Var(X) = ae = 4350. 


3. (a) Let X denote the number of tails that are obtained before five heads are obtained, and let Y denote 
the total number of tosses that are required. Then Y = X + 5. Therefore, E(Y) = E(X) +5. It 
follows from Exercise 2(a) that E(Y) = 150. 


(b) Suppose Y = X +5, then Var(Y) = Var(X). Therefore, it follows from Exercise 2(b) that 
Var(Y) = 4350. 
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4. (a) The number of failures X 4 obtained by player A before he obtains r successes will have the negative 
binomial distribution with parameters r and p. The total number of throws required by player A 
will be Y4 = X4 +1. Therefore, 

eae +r = an 

Pp 

The number of failures Xg obtained by player B before he obtains mr successes will have the 

negative binomial distribution with parameters mr and mp. The total number of throws required 

by player B will be Yp = Xp +mr. Therefore, 


E(Yp) = E(Xg)+mr= (nr) +mr = - 


E(Ya) = E(Xa) +r=r 


(b) 
Var(Ya) = Var(X,4) = me) = a4 —p) and 
p p 
Var(Ye) = Var(Xp) = Ee) _ 2 (2-9). 


Therefore, Var(Yg) < Var(Ya). 


5. By Eq. (5.5.6), the m.g.f. of X; is 


wt) = (ea) for t< log (=) . 


Therefore, the m.g.f. of X; +---+ Xz is 


vii Toto = (|) for t< log (—) 


Since w(t) is the m.g.f. of the negative binomial distribution with parameters rj; + --- +r, and p, that 
must be the distribution of X; +---+ Xz. 


6. For « = 0,1,2,..., 
Pr(X = z) = p(1— p)’. 


If we let x = 27, then as 7 runs through all the integers 0,1,2,..., the value of 27 will run through all 


the even integers 0,2,4,.... Therefore, 
[oe] (oe) 1 
. 7 i Qi _ 
Pr(X is an even integer) = 2 Ph —p)"= pd(U — p|*)' = Pr dae 
(oe) [o.e) 
7 PrXA Sk) = S> pl — p)* = p(l—p)* pal —p)*—*. If we let i= a2 — k, then 
t=) L=j 
= 1 


Pr(X > k) = p(1—p)* S\(1—p)' =p -TIa = (1—p)*. 


t=) 


Section 5.5. The Negative Binomial Distributions 15f 


PrixX=k+tandX>k) Prix =k+t) 

Pr(X > k) ~~ Pre > ky” 
By Eq. (5.5.3), Pr(X = k +t) = p(1 —p)***. By Exercise 7, Pr(X>k) = (1 —p)*. Therefore, 
Pr(X =k+t|X >k)=p(1—p)' =Pr(X =2). 


8. Pr(X =k+t|X>k)= 


9. Since the components are connected in series, the system will function properly only as long as every 
component functions properly. Let X; denote the number of periods that component 7 functions prop- 
erly, fori =1,...,n, and let X denote the number of periods that system functions properly. Then for 
any nonnegative integer 2, 

Pr X oe) =P XxX) Pais Xe) = Pr AS 2). Pri kg Sa); 
because the n components are independent. By Exercise 7, 


Pr Ay Sa) = —p,)? = 0 = 9)". 


Therefore, Pr(X > x) = Jj, (1 — p;)®. It follows that 


P(X =2) = P(X >2)—Pr(X >e+1)=]]a-»)*-T]a-p? 


(.- [la] (Ta-n9) 


i=l i=1 


It can be seen that this is the p.f. of the geometric distribution with p = 1 — []#_,(1—p;). 


10. By the assumptions of the problem we have p = 1— A/r and r > oo. To simplify some of the formulas, 
let g=1-—pso that q = A/r. It now follows from Eq. (5.5.1) that 


flelrp) = oes) fa 
_ Weta nila a= 2-0) (yd 


Piste ale Dealt Goh 


x 


Hence, f(x|r,p) +“ exp(-d) = f(a». 


11. According to Exercise 10 in Sec. 5.3, 


Eee) 
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This makes 


{rum = ("FE ora 


which is the proper form of the negative binomial p.f. for x = 0,1,2,.... 


12. The joint p.f./p.d.f. of X and P is f(p) times the geometric p.f. with parameter p, that is 
p(1 — p)*10(1 — p)? = p(1 — p)**°, for x =0,1,... and0O<p<1. (8.5.7) 
The marginal p.f. of X at x = 12 is the integral of (S.5.7) over p with x = 12 substituted: 


[ a 2g -[ 21(4 Jd _ 1 1 _ 1 
ae ae ee PP 59 23 «4506 


The conditional p.d.f. of P given X = 12 is (8.5.7) divided by this last value 
g(p|12) = 506p(1 — p)*', for O<p<1. 


13. (a) The memoryless property says that, for all k,t > 0, 


Pr(X =k+t) 


1-FG=1 = Prix =F). 


(The above version switches the use of k and t from Theorem 5.5.5.) If we sum both sides of this 
over k=h,h+1,..., we get 


Fit) 
1-F(¢t—-h 


(b) €(t+h) = log{l — F(t +h —1)]. From part (a), we have 


St 7h = 1), 


1—-F(t+h-1)=([1-F(t-1)][1- F(A-1)], 
Hence 
(t+h) = log({1 — F(t — 1)] + log]l — F(h — 1)] = €(¢) + &(h). 


(c) We prove this by induction. Clearly (1) = 1 x ¢(1), so the result holds for t = 1. Assume that the 
result holds for all t < to. Then ¢(tp + 1) = (to) + 2(1) by part (b). By the induction hypothesis, 
€(to) = tof(1), hence (to + 1) = (t) + 1)é(1), and the result holds for t = to + 1. 


(d) Since £(1) = log/1 — F'(0)], we have @(1) < 0. Let p = 1 — exp[@(1)], which between 0 and 1. For 
every integer x > 1, we have, from part (c) and the definition of @, that 


F(a — 1) =1-—-exp[é(z)] = 1 — exp[x@(1)] = 1 -— (1 —-p)’. 
Setting t = x —1 for x > 1, we get 
F@ =1-1—p)™, for f= 0,1, «:. (S.5.8) 


It is easy to verify that (S.5.8) is the c.d.f. of the geometric distribution with parameter p. 
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5.6 The Normal Distributions 


Commentary 


In addition to introducing the family of normal distributions, we also describe the family of lognormal dis- 
tributions. These distributions arise frequently in engineering and financial applications. (Examples 5.6.9 
and 5.6.10 give two such cases.) It is true that lognormal distributions are nothing more than simple trans- 
formations of normal distributions. However, at this point in their study, many students will not yet be 
sufficiently comfortable with transformations to be able to derive these distributions and their properties 
without a little help. 

If one is using the statistical software R, then the functions dnorm, pnorm, and qnorm give the p.d_f., 
the c.d.f., and the quantile function of normal distributions. The syntax is that the first argument is the 
argument of the function, and the next two are the mean and standard deviation. The function rnorm gives 
a random sample of normal random variables. The first argument is how many you want, and the next two 
are the mean and standard deviation. All of the solutions that require the calculation of normal probabilites 
and quantiles can be done using these functions instead of tables. There are also functions dlnorm, plnorn, 
qlnorm, and rlnorm that compute similar features for lognormal distributions. 


Solutions to Exercises 


1. By the symmetry of the standard normal distribution around 0, the 0.5 quantile must be 0. The 0.75 
quantile is found by locating 0.75 in the ®() column of the standard normal table and interpolating in 
the « column. We find ©(0.67) = 0.7486 and ®(0.68) = 0.7517. Interpolating gives the 0.75 quantile as 
0.6745. By symmetry, the 0.25 quantile is —0.6745. Similarly we find the 0.9 quantile by interpolation 
using ®(1.28) = 0.8997 and ®(1.29) = 0.9015. The 0.9 quantile is then 1.282 and the 0.1 quantile is 
—1.282. 


2. Let Z = (X —1)/2. Then Z has the standard normal distribution. 


(a) Pr(X < 3) =]PrZ < 1) = 6(1) =0.8413 

(b) Pr(X > 1.5) = Pr(Z > 0.25) = 1 — (0.25) = 0.4013. 

(c) Pr(X = 1) =0, because X has a continuous distribution. 

(d) Pr(2 < X <5) =Pr(0.5 < Z < 2) = &(2) — 8(0.5) = 0.2858. 

(e) Pr(X > 0) = Pr(Z > —0.5) = Pr(Z < 0.5) = 8(0.5) = 0.6915. 

(f) Pr(—1 < X < 0.5) = Pr(-1 < Z < —0.25) = Pr(0.25 < Z < 1) = B(1) — &(0.25) = 0.2426. 
(g) 


Pr(|X | <2) = Pr(-2< X <2)=]Pr(—-1L5 <Z < 0.5) 
Pr(Z < 0.5) — Pr(Z < —1.5) = Pr(Z < 0.5) 
—Pr(Z > 1.5) = &(0.5) — [1 — ®(1.5)] = 0.6247. 


Pr(l < -2X +3<8) = Pr(—2<—-2X <5) =Pr(-2.5 < X <1) 
= Pr(-1.75 < Z <0) =Pr(0 < Z < 1.75) 
= (1.75) — &(0) = 0.4599 
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. If X denotes the temperature in degrees Fahrenheit and Y denotes the temperature in degrees Celsius, 


then Y = 5(X — 32)/9. Since Y is a linear function of X, then Y will also have a normal distribution. 
Also, 


2 
E(Y) = (68 —32)=20 and Var(¥) = (2) (16) = — 


. The q quantile of the temperature in degrees Fahrenheit is 68 + 46~'(q). Using Exercise 1, we have 


6~1(0.75) = 0.6745 and ®~!(0.25) = —0.6745. So, the 0.25 quantile is 65.302, and the 0.75 quantile is 
70.698. 


. Let A; be the event that chip 7 lasts at. most 290 hours. We want the probability of U3_,A%, whose 


probability is 
3 
1—Pr (MLAs) =l1- II Pr(A;). 
i=1 


Since the lifetime of each chip has the normal distribution with mean 300 and standard deviation 10, 
each A; has probability 


®([290 — 300]/10) = ®(—1) = 1 — 0.8413 = 0.1587. 


So the probability we want is 1 — 0.1587 = 0.9960. 


. By comparing the given m.g.f. with the m.g.f. of a normal distribution presented in Eq. (5.6.5), we can 


see that, for the given m.g-f., 4 = 0 and o? = 2. 


. If X is a measurement having the specified normal distribution, and if Z = (X — 120)/2, then Z will 


have the standard normal distribution. Therefore, the probability that a particular measurement will 
lie in the given interval is 


p =Pr(116 <_X < 118) =Pr(—-2 < Z < -1) = Pr(1 < Z < 2) = 8(2) — ®(1) = 0.1360. 


The probability that all three measurements will lie in the interval is p®. 


. Except for a constant factor, this integrand has the form of the p.d.f. of a normal distribution for which 


y= 0 and o? = 1/6. Therefore, if we multiply the integrand by 


1 ate 
(Qr)i/2q (=) 
we obtain the p.d.f. of a normal distribution and we know that the integral of this p.d.f. over the entire 
real line must be equal to 1. Therefore, 


fore) 1/2 
/ exp(—3a7)dx = (=) . 


Finally, since the integrand is symmetric with respect to x = 0, the integral over the positive half of 
the real line must be equal to the integral over the negative half of the real line. Hence, 


ioe) 1 1/2 
[ exp(—327)dx = 5 (=) : 


9. 


10. 


Li 


12. 


13. 
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The total length of the rod is X = A+ B+C-—4. Since X is a linear combination of A, B, and C, it 
will also have the normal distribution with 


E(X) =20+14+26—4=56 


and Var(X) = 0.04 + 0.01 + 0.04 = 0.09. If we let Z = (X — 56)/0.3, then Z will have the standard 
normal distribution. Hence, 


Pr(55.7 < X < 56.3) =Pr(—1 < Z < 1) = 2®(1) — 1 = 0.6827. 


We know that E(X,) = p and Var(Xpn) = o7/n = 4/25. Hence, if we let Z = (Xn — p)/(2/5) = 
(5/2)(Xn —), then Z will have the standard normal distribution. Hence, 


Pr(|Xn—p| <1) =Pr(|Z| < 2.5) = 26(2.5) — 1 = 0.9876. 
If we let Z = /n(X» — 1)/2, then Z will have the standard normal distribution. Therefore, 
Pr(| Xn —p| < 0.1) = Pr(|Z| < 0.05,/n) = 20(0.05/n) — 1. 


This value will be at least 0.9 if 26(0.05\/n) — 1 > 0.9 or ©(0.05,/n) > 0.95. It is found from a table 
of the values of ® that we must therefore have 0.05,/n > 1.645. The smallest integer n which satisfies 
this inequality is n = 1083. 


(a) The general shape is as shown in Fig. S.5.1. 


ae 


= 0 1 x 


Figure $.5.1: Figure for Exercise 12 of Sec. 5.6. 


(b) The sketch remains the same with the scale changed on the x-axis so that the points —1 and 
0 become —5 and —2, respectively. It turns out that the point x = 1 remains fixed in this 
transformation. 


Let X denote the diameter of the bolt and let Y denote the diameter of the nut. The Y —X will have 
the normal distribution for which 


E(Y — X) = 2.02 -2=0.02 
and 
Var(Y — X) = 0.0016 + 0.0009 = 0.0025. 
If we let Z = (Y — X — 0.02)/0.05, then Z will have the standard normal distribution. Therefore, 


Pr(0 < Y — X < 0.05) = Pr(—-0.4 < Z < 0.6) = 6(0.6) — [1 — ®(0.4)] = 0.3812. 
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14. Let X denote the average of the two scores from university A and let Y denote the average of the three 


15. 


16. 


scores from university B. Then X has the normal distribution for which 


= — 1 
E(X) =625 and Var(X)= > = 50. 


Also, Y has the normal distribution for which 


= — 150 
E(Y)=600 and Var(Y) = — =50. 
3 


Therefore X — Y has the normal distribution for which 
E(X —Y) =625-—600=25 and Var(X —Y)=50+50= 100. 

It follows that if we let Z = (X —Y —25)/10, then Z will have the standard normal distribution. Hence, 
Pr(X —Y > 0) =Pr(Z > —2.5) = Pr(Z < 2.5) = ®(2.5) = 0.9938. 


Let fi(x) denote the p.d.f. of X if the person has glaucoma and let f2(2) denote the p.d.f. of X if the 
person does not have glaucoma. Furthermore, let A; denote the event that the person has glaucoma 
and let Ay = AY denote the event that the person does not have glaucoma. Then 


Pr(Aj;) = 0.1, Pr(Ag) = 0.9, 


jit) = (on) exp {-3 — 25)" for —co<2“2<0o, 
fo(z) = Pau exp {-3 —- 20)"} for =< eo, 
(a) Pr(A, |X =2) = Pr( Ai) fa(@) 


Pr(A1) fi(@) + Pr(A2) fa(x) 
(b) The value found in part (a) will be greater than 1/2 if and only if 
Pr(Aj) fi(z) > Pr(A2) fa(z). 
All of the following inequalities are equivalent to this one: 
(i) exp{—(ax — 25)?/2} > 9exp{—(x — 20)?/2} 
(ii) —(a — 25)?/2 > log9 — (x — 20)?/2 
(iii) (x — 20)? — (2 — 25)? > 2log9 
(iv) 10x — 225 > 2log9 
(v) @ > 22.5 + log(9)/5. 


The given joint p.d.f. is the joint p.d.f. of two random variables that are independent and each of which 
has the standard normal distribution. Therefore, X + Y has the normal distribution for which 


E(X+Y)=04+0=0 and Var(X+Y)=14+1=2. 
If we let Z = (X + Y)/V2, then Z will have the standard normal distribution. Hence, 


Pr(-V2< X+Y <2V2) = Pr(-1<Z<2) 
Pr(Z < 2) —Pr(Z < —1) 
®(2) — [1 — ®(1)] = 0.8186. 
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17. If Y = log X, then the p.d.f. of Y is 


18. 


19. 


1 1 
ot) = Gowag oP {gpa — for —co<y<oo. 


d 1 
Since = = —, it now follows that the p.d.f. of X, for x > 0, is f(x) = g(log x) /z. 
ya 
Let U = X/Y and, as a convenient device, let V = Y. If we exclude the possibility that Y = 0, the 
transformation from X and Y to U and V is then one-to-one. (Since Pr(Y = 0) = 0, we can exclude 
this possibility.) The inverse transformation is 


X=UV and Y=V. 


Hence, the Jacobian is 


Ox Ox 
7 du Ov | _ 1 
J = det ay dy = det E i =v. 
Ou Ou 


Since X and Y are independent and each has the standard normal distribution, their joint p.d.f. f(x,y) 
is as given in Exercise 16. Therefore, the joint p.d.f. g(u,v) of U and V will be 


g(u,v) = f(uv,v) |v] = Phew {-3(w + ne} 


To find the marginal p.d.f. gi(u) of U, we can now integrate g(u,v) over all values of v. (The fact that 
the single point v = 0 was excluded does not affect the value of the integral over the entire real line.) 
We have 


% |v 1 
g(a) = / Ph exp {Su +1)" 
921 1 
= | = exp {—5(u? + 1)0?} ude 
T 2 
1 
= for —©o<u<oo 
n(u? + 1) 


It can now be seen that gi(u) is the p.d.f. of a Cauchy distribution as defined in Eq. (4.1.7). 


The conditional p.d.f. of X given p is 
1 2 
g(z|H) = (anya xP — p)°/2), 


while the marginal p.d.f. of wis fo(w) = 0.1 for 5 < w < 15. We need the marginal p.d.f. of X, which 
we get by integrating yz out of the joint p.d-f. 


fee 
(any xP (x — p)°/2), for 5 < pw < 15. 


g(a) fo(u) = 
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21. 


22. 


23. 
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The integral is 


fla) = [ep exwl-(e ~ 1)?/2)dh = 0.11005 ~ 2) ~ 865 ~ 2. 


With x = 8, the value is 0.1[®(7) — ®(—3)] = 0.0999. This makes the conditional p.d.f. of given 
X=8 


1.0013 j 
92(u|8) = CAE exp(—(8 — #)°/2), for5 <p < 15. 


This probability is the probability that log(X) < log(6.05) = 1.80, which equals 
&((1.80 — 3]/1.44/?) = &(—1) = 0.1587. 


Note that log(XY) = log(X) + log(Y). Since X and Y are independent with normal distributions, 
we have that log(XY) has the normal distribution with the sum of the means (4.6) and sum of the 
variances (10.5). This means that XY has the lognormal distribution with parameters 4.6 and 10.5. 


Since log(1/X) = —log(X), we know that —log(X) has the normal distribution with mean —y and 
variance o?. This mean that 1/X has the lognormal distribution with parameters —y and o?. 


Since log(3.X'/?) = log(3) + log(X)/2, we know that log(3.X‘/?) has the normal distribution with mean 
log(3) + 4.1/2 = 3.149 and variance 8/4 = 2. This means that 3X'/? has the lognormal distribution 
with parameters 3.149 and 2. 

First expand the left side of the equation to get 


iS a;(x — bj)? + cx = ca + Ss lae? — 2ajbjx + 07]. (S.5.9) 
i=1 i=1 


Now collect all the squared and linear terms in 2. The coefficient of 2? is S7"_, a;. The coefficient of x 
is c—2>7"_, ajb;. The constant term is 7, a;b?. This makes (S.5.9) equal to 


nm 
PY ae 
t=1 


n 
c-—2 > ab; 
i=l 


+ 5° aib;. (S.5.10) 
i=1 


Next, expand each term on the right side of the original equation to produce 


n n 2 n 
- | Se ajbj —c/2 (>: a) — ey. ajb; +¢/4 
(>: «) 7? — 2p t=! _ 4 el _ i=l 


i=l i=l 
- (>: a) ey ab; = ce /4 
ts s- a,b? = on a ———————— 
a Sai > Qi 
i=1 i=1 


Combining like terms in this expression produces the same terms that are in (S.5.10). 
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25. Divide the time interval of u years into n intervals of length u/n each. At the end of n such intervals, 
the principal gets multiplied by (1 + ru/n)”. The limit of this as n > oo is exp(ru). 


26. The integral that defines the mean is 


a x - 
E(X) = ie (amyir2 &%P -5] dx. 


The integrand is a function f with the property that f(—x) = —f(ax). Since the range of integration is 
symmetric around 0, the integral is 0. The integral that defines the variance is then 


fore) 2 x 
Var(X) = 2") = |. (amyi2 XP 5] aa, 


In this integral, let w= a and 


It is easy to see that du = dx and 


v= Gna exp |—> | - 
Integration by parts yields 


o 1 x? 
Var(X) =  Qni2z exp -5] + | oper -5] iis 


The term on the right above equals 0 at both oo and —oo. The remaining integral is 1 because it is the 
integral of the standard normal p.d.f. So Var(X) = 1. 


L=—CoO 


5.7 The Gamma Distributions 


Commentary 


Gamma distributions are used in the derivation of the chi-square distribution in Sec. 8.2 and as conjugate 
prior distributions for various parameters. The gamma function arises in several integrals later in the text 
and is interesting in its own right as a generalization of the factorial function to noninteger arguments. 

If one is using the statistical software R, then the function gamma computes the gamma function, and 
lgamma computes the logarithm of the gamma function. They take only one argument. The functions dgamma, 
pgamma, and qgamma give the p.d-f., the c.d.f., and the quantile function of gamma distributions. The syntax 
is that the first argument is the argument of the function, and the next two are a and ( in the notation 
of the text. The function rgamma gives a random sample of gamma random variables. The first argument 
is how many you want, and the next two are a and £. All of the solutions that require the calculation of 
gamma probabilites and quantiles can be done using these functions. There are also functions dexp, pexp, 
qexp, and rexp that compute similar features for exponential distributions. Just remove the “a” parameter. 


Solutions to Exercises 


1. Let f(x) denote the p.d.f. of X and let Y =cX. Then X = Y/c. Since dx = dy/c, then for x > 0, 


a a-l c)@ ; 
gy) = 2 (4) = 25 (4) exw-stu/a) = Fe exp(-(6/0. 
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2. The c.d.f. of the exponential distribution with parameter { is F(x) = 1 — exp(—8z) for x > 0. The 
inverse of this is the quantile function F~!(p) = —log(1 — p)/8. 


3. The three p.d.f.’s are in Fig. $.5.2. 


f(x) 


f(x) f(x) 


(b) (c) 


Figure $.5.2: Figure for Exercise 3 of Sec. 5.7. 


4. 
(2) = aye exp(—(zx) for ¢ > 0. 
@) = Tana —1- Bx)x°-* exp(—fz) fora > 0. 


If a <1, then f’(x) < 0 for z > 0. Therefore, the maximum value of f(x) occurs at z = 0. If a > 1, 
then f’(x) = 0 for « = (a — 1)/8 and it can be verified that f(x) is actually a maximum at this value 
of x. 


5. All three p.d.f.’s are in Fig. $.5.3. 


6. Each X; has the gamma distribution with parameters 1 and $8. Therefore, by Theorem 5.7.7, the sum 
wx, has the gamma distribution with parameters n and 8. Finally, by Exercise 1, X, = ae /n 


i=1 i=1 
has the gamma distribution with parameters n and n{. 


7. Let A; = {X; >t} fori =1,2,3. The event that at least one x“ is greater than t 2 U3, A;. We could 
use the formula in Theorem 1.10.1, or we could use that Pr(U?_, 4;) = 1 — Pr((_, A$). The latter is 
easier because the X; are mutually independent and identically distributed. 


(9 as) = Pr(A{)? = [1 — exp(—6t)]°. 


So, the probability we want is 1 — [1 — exp(—t)]°. 


10. 


Li 


12. 
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Figure $.5.3: Figure for Exercise 5 of Sec. 5.7. 


. For any number y > 0, 


Pry > y) = Pr(Xy > y,...,Xp > y) = Pri Xy > G7) <n PRX, P y) 
= exp(—fiy)...exp(—A.y) = exp(—(A1 +--+ + Be)y), 


which is the probability that an exponential random variable with parameter 3; +---+ (, is greater 
than y. Hence, Y has that exponential distribution. 


. Let Y denote the length of life of the system. Then by Exercise 8, Y has the exponential distribution 


with parameter 0.001 + 0.003 + 0.006 = 0.01. Therefore, 
i 
Pr(Y > 100) = exp(—100 (0.01)) = -. 
e 


Since the mean of the exponential distribution is 4, the parameter is 6 = 1/y. Therefore, the distri- 
bution of the time until the system fails is an exponential distribution with parameter n3 = n/p. The 
mean of this distribution is 1/(n@) = u/n and the variance is 1/(n8)? = (u/n)?. 


The length of time Y; until one component fails has the exponential distribution with parameter nf. 
Therefore, E(Y,) = 1/(n8). The additional length of time Y2 until a second component fails has 
the exponential distribution with parameter (n — 1)6. Therefore, E(Y2) = 1/[(n — 1)6]. Similarly, 
E(Y3) = 1/|(n—2)6]. The total time until three components fail is Y; + Yo+ Y3 and E(Y, + Y2+Y3) = 
1 1 1 1 
n = n—-1 = n—2 Bo 
The length of time until the system fails will be Y; + Y2, where these variables were defined in Exer- 
1 


1 1 1 
cise 11. Therefore, E(Y; + Y2) = — + ——— = (-+ —) js. Also, the variables Y; and Y2 are 
nB (n—-1)8 n n-1 


independent, because the distribution of Yo is always the same exponential distribution regardless of 
the value of Y;. Therefore, 


Var(Y, + Y2) = Var(¥1) + Var(Y2) = —s> 
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13. The time Y; until one of the students completes the examination has the exponential distribution with 
parameter 53 = 5/80 = 1/16. Therefore, 


Pr(¥; < 40) = 1 — exp(—40/16) = 1 — exp(—5/2) = 0.9179. 


14. The time Y2 after one students completes the examination until a second student completes it has the 
exponential distribution with parameter 43 = 4/80 = 1/20. Therefore, 


Pr(Y2 < 35) = 1 — exp(—35/20) = 1 — exp(—7/4) = 0.8262. 


15. No matter when the first student completes the examination, the second student to complete the 
examination will do so at least 10 minutes later than the first student if Yo > 10. Similarly, the third 
student to complete the examination will do so at least 10 minutes later than the second student if 
Y3 > 10. Furthermore, the variables Y,, Y2, Y3, Y4, and Ys are independent. Therefore, the probability 
that no two students will complete the examination within 10 minutes of each other is 


Pr(Yo >10,...,¥s>10) = Pr(¥>10)...Pr(¥5 > 10) 
= exp(—(10)48) exp(—(10)38) exp(—(10)26) exp(—108) 
= exp(—40/80) exp(—30/80) exp(—20/80) exp(—10/80) 
= exp(—5/4) = 0.2865. 


16. If Y = log(X/zo), then X = xpexp(Y). Also, dx = xoexp(y)dy and x > 2p if and only if y > 0. 
Therefore, for y > 0, 


gly) = f (xo exp(y)|20, a) x0 exp(y) = aey”™. 


17. 
ore) xr — wu) 
Bix =") = fe wares |S | ae 


2 a x — p)? 
mits J, ce = nexp | oe = hae 


Let y = (x — p)?. Then dx = dy/(2y'/?) and the above integral can be rewritten as 


(2r)'/2a Jo 20? J Qy1/2 (27)!/20 Jo 207 


The integrand in this integral is the p.d.f. of a gamma distribution with parameters a = n+ 1/2 and 
B = 1/(207), except for the constant factor 


oa : 


T(a@) (202)"+1/2P(n + 1/2) 


Since the integral of the p.d.f. of the gamma distribution must be equal to 1, it follows that 


i i? exp {4} dy = (202)"*¥?E(n + 1/2). 
a 20? 
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2 2 2 2 


BU(X - 1)?" samen (n- 3) (n- 3)... (3) a” 


2-96-90 


= (2n—1)(2n —3)...(1)o?". 


1 1 1 
From Eqs. (5.7.6) and (5.7.9), T (n + 5) = (n — 5) (n — 5) vee (5) x'/?_ Therefore, 


18. For the exponential distribution with parameter {, 
f(x) = Bexp(—Sz) 
and 
1— F(a) = Pr(X > x) = exp(—fz). 
Therefore, h(x) = 8 for x > 0. 


1 
19. Let Y = X°. Then X = Y"/? and dx = pe dy. Therefore, for y > 0, 
1/b 1 ap _ 1 b 
gy) = fy la, boy = op exp(—y/a’). 
20. If X has the Weibull distribution with parameters a and b, then the c.d.f. of X is 


P(x) = [th exp(—(t/a))at = [- exp(-(t/a) If = 1 — exp(-(e/a)) 


Therefore, 
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If b > 1, then A(z) is an increasing function of x for x > 0, and if b < 1, then h(x) is an decreasing 


function of x for x > 0. 


21. (a) The mean of 1/X is 


co 1 Bo a = @ Tie 
7 Rye exw(-Bz)de = 7 f a? exp(—6x)dz = B Nes2) Pe B 


(b) The mean of 1/X? is 


co] Be RP fe _ £° Tla—2) 
era lexp(—Bx)dxr = ray I, a3 exp(—6x)dz = Ta) ge? 
B? 
(a — 1)(a@ — 2) 


This makes the variance of 1/X equal to 


B? B\_ B? 
(a—1)(a—2) (5) ~ (a —1)?(a — 2)’ 


Tfa) B= ~a—1 
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22. The conditional p.d.f. of A given X = x can be obtained from Bayes’ theorem for random variables 
(Theorem 3.6.4). We know 


(At)" 


g(z|A) = exp(—At)— tor =O, Lisa 
fo(A) = B dX? exp(—AB), for A > 0. 
(a) 
The marginal p.f. of X is 
t* Be ‘a ata—-1 
file) = SE fo rettexp(—alB + t))aa 
x'T(a) Jo 
t®B°T(at+ x) 


ar(ay(e +h 
So, the conditional p.d.f. of \ given X = x is 


_ BB aot 


g2(Alx) T(a+2) 


exp(—A[6+4+ t]), for A > 0, 


which is easily recognized as the p.d.f. of a gamma distribution with parameters a +2 and 6 +t. 
23. The memoryless property means that Pr(X >t+h|X >t) = Pr(X > h). 


(a) In terms of the c.d.f. the memoryless property means 
1—F(t+h) 
1 — F(t) 
(b) From (a) we obtain [1 — F'(A)][1 — F(#)] = [1 — F(t +h)]. Taking logarithms of both sides yields 
e(h) + l(t) = &(t +h). 
(c) Apply the result in part (b) with h and t both replaced by t/m. We obtain ¢(2t/m) = 20@(t/m). 


Repeat with t replaced by 2t/m and h =t/m. The result is £(3t/m) = 3¢(t/m). After k — 1 such 
applications, we obtain 


e(kt/m) = ke(t/m). (8.5.11) 
In particular, when k = m, we get ¢(t) = mé(t/m) or &(t/m) = &(t)/m. Substituting this into 
(S.5.11) we obtain ¢(kt/m) = (k/m)é(t). 

(d) Let c > 0 and let c,c2,... be a sequence of rational numbers that converges to c. Since @ is a 
continuous function, ¢(c,t) > €(ct). But €(c,t) = c,é(t) by part (c) since c, is rational. It follows 
that c,l(t) > €(ct). But, we know that c,0(t) > cé(t). So, cé(t) = &(ct). 

(e) Apply part (d) with c = 1/t to obtain ¢(t)/t = ¢(1), a constant. 

(f) Let 6 = €(1). According to part (e), €(t) = St for all t > 0. Then log[1 — F(x)] = Gx for x > 0. 
Solving for F(x) gives F(x) = 1 — exp(—(x), which is the c.d.f. the exponential distribution with 
parameter @ = (1). 


=1-—F(h). 


24. Let w be the m.g.f. of W,,. The mean of S,, is 


E(Su) = Sof (exp(wu + W,)) = So exp(uu)p(1). 
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(a) Since W,, has the gamma distribution with parameters au and 6 > 1, the m.g.f. is Y(t) = (6/[6 — 
t])°“. This makes the mean 


E(Sy) = So exp(uu) (=) 


So exp(—ru)E(S,,) = So if and only if 


exp((u— yu) (E>) =1. 


Solving this equation for ju yields 


w=r—alog (7). 


(b) Once again, we use the function 


mad ={ j if x <q. 


The value of the option at time u is h(.S,,). Notice that S,, > q if and only if W,, > log(q/So)—pu = 
c, as defined in the exercise. Then the present value of the option is 


exp(—ru)E[h(S,)]| = exp(—ru) a [So exp(uu + w) — dtm exp(—Gw)dw 


= Semin Wom if * ww lexp(—wl3 — 1))dw 
—qexp(— et we! exp(—Bw)dw 


_ a 
= ear we"! exp(—(6 — 1)w)dw — qexp(—ru)R(c8) 
= Gne i 1]) — gexp(—ru)R(cf). 
(c) We plug the values u = 1, g = So, r = 0.06, a = 1, and 6 = 10 into the previous formula to get 


c = log(10/9) — 0.06 = 0.0454 
So | R(0.0454 x 9) — e~ °° R(0.0454 x 10)] = 0.066550. 


5.8 The Beta Distributions 


Commentary 


Beta distributions arise as conjugate priors for the parameters of the Bernoulli, binomial, geometric, and 
negative binomial distributions. They also appear in several exercises later in the text, either because of 
their relationship to the t and F' distributions (Exercise 1 in Sec. 8.4 and Exercise 6 in Sec. 9.7) or as 
examples of numerical calculation of M.L.E.’s (Exercise 10 in Sec. 7.6) or calculation of sufficient statistics 
(Exercises 24(h), 24(i) in Sec. 7.3, Exercise 7 in Sec. 7.7, and Exercises 2 and 7(c) in Sec. 7.8). The derivation 
of the p.d.f. of the beta distribution relies on material from Sec. 3.9 (particularly Jacobians) which the 
instructor might have skipped earlier in the course. 

If one is using the statistical software R, then the function beta computes the beta function, and lbeta 
computes the logarithm of the beta function. They take only the two necessary arguments. The functions 
dbeta, pbeta, and qbeta give the p.d.f., the c.d.f., and the quantile function of beta distributions. The syntax 
is that the first argument is the argument of the function, and the next two are a and £ in the notation of the 


172 Chapter 5. Special Distributions 


text. The function rbeta gives a random sample of beta random variables. The first argument is how many 
you want, and the next two are a and 8. All of the solutions that require the calculation of beta probabilites 
and quantiles can be done using these functions. 


Solutions to Exercises 


1. The c.d.f. of the beta distribution with parameters a > 0 and 6 = 1 is 


0 fora <0, 
Pig\= <a? for 0—g-< 1, 
1 fore >1. 


Setting this equal to p and solving for x yields F~!(p) = p'/. 


2. f'(zla, 8) = ala 1)(1 — x) — (8 — 1)a]r%-?(1 — 2)’-?. Therefore, f’(z|a,8) = 0 and 


x = (a—1)/(a+ 8 — 2). It can be verified that if a > 1 and 6 > 1, then f(zla,@) is actually a 
maximum for this value of x. 


3. The vertical scale is to be chosen in each part of Fig. $.5.4 so that the area under the curve is 1. The 
figure in (h) is the mirror image of the figure in (g) with respect to x = 1/2. 


4, Lett Y =1-—X. Then X =1-—Y. Therefore, |dz/dy| = 1 and, 0<y< 1, 


T'(a + B) 


_,,\a-1, B-1 
Fara 


gy) =f-y) = 


This is the p.d.f. of the beta distribution with the values @ and ( interchanged. 


Be eky | F(a)r(B) 


1 
| got ly _ 2)P+s—-l dy 
0 


Ta+r)P(6 +s) 

Tia+B+r+s) 

a+r) (bts) _ TMe+s) 

(a) ie) Tle e rss) 

ale D(a r= DIP +1) (Bs = 1) 
(a+B\(a+B+1)---(a+B+r+s—-1) — 


[ x“ (1- gerieen) £7) e* — @)P de 
Ir 


6. The joint p.d.f. of X and Y will be the product of their marginal p.d.f.’s Therefore, for « > 0 and y > 0, 


fox ae exp(—Ba) oye! exp(—By) 
pores ai—1l, a2g—-1 


= Tara) y exp(—A(x + y)). 
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Figure $.5.4: Figure for Exercise 3 of Sec. 5.8. 


Also, X = UV and Y = (1—U)V. Therefore, the Jacobian is 


dr ae 
_ Ou Ov| _ v th | 
J = det ay dy = aut], hae 
Ou Ov 


As x and y vary over all positive values, u will vary over the interval (0, 1) and v will vary over all 
possible values. Hence, for 0 <u <1 and v > 0, the joint p.d.f. of U and V will be 


_ _ Tay + a2) a\— a2— poe ayt+a2— 
glu, v) = fluv, (1 —u)ujy = Tartan)” “1 —t) haa oa +22~! exp(—v). 


It can be seen that this joint p.d.f. has been factored into the product of the p.d.f. of a beta distribution 
with parameters a; and a2 and the p.d.f. of a gamma distribution with parameters a1 + ag and £. 
Therefore, U and V are independent, the distribution of U is the specified beta distribution, and the 
distribution of V is the specified gamma distribution. 
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7. Since X; and X» each have the gamma distribution with parameters @ = 1 and £, it follows from 
Exercise 6 that the distribution of X,/(X 1 + X2) will be a beta distribution with parameters a = 1 
and 6 = 1. This beta distribution is the uniform distribution on the interval (0, 1). 


8. (a) Let A denote the event that the item will be defective. Then 
a 


1 1 
Pr(A) =| Pr(A |x) f(x) dex =| f(z) de = B(X) =, 


(b) Let B denote the event that both items will be defective. Then 


Pr(B) = * Pr(B |x) f(e) dex * a2 fe) dx = E(X?) = a(a + 1) 
. 0 


(o+ Pilate +i1) 


9. Prior to observing the sample, the mean of P is a/(a+) = 0.05, which means that a = 6/19. If we use 
the result in the note that follows Example 5.8.3, the distribution of P after finding 10 defectives in a 
sample of size 10 would be beta with parameters a+10 and 8, whose mean is (a+10)/(a+8+10) = 0.9. 
This means that a = 96 — 10. So 96 — 10 = 6/19 and 8 = 19/17 soa =1/17. The distribution of P 
is then a beta distribution with parameters 1/17 and 19/17. 


10. The distribution of P is a beta distribution with parameters 1 and 1. Applying the note after Exam- 
ple 5.8.3 with n = 25 and x = 6, the conditional distribution of P after observing the data is a beta 
distribution with parameters 7 and 20. 


5.9 The Multinomial Distributions 


Commentary 


The family of multinomial distributions is the only named family of discrete multivariate distributions in the 
text. It arises in finite population sampling problems, but does not figure in the remainder of the text. 

If one is using the statistical software R, then the function dmultinom gives the joint p.f. of a multinomial 
vector. The syntax is that the first argument is the argument of the function and must be a vector of the 
appropriate length with nonnegative integer coordinates. The next argument must be specified as prob= 
followed by the vector of probabilities, which must be a vector of the same length as the first argument. 
The function rmultinom gives a random sample of multinomial random vectors. The first argument is how 
many you want, the next argument specifies what the sum of the coordinates of every vector must be (n 
in the notation of the text), and the third argument is prob as above. All of the solutions that require the 
calculation of multinomial probabilites can be done using these functions. 


Solutions to Exercises 


1. Let Y = X, +---+ Xe. We shall show that Y has the binomial distribution with parameters n and 
py t-::+pe. Let Z21,...,Z, be ii.d. random variables with the p.f. 


_ Jj mw forz=1,i=1,...,k, 
fz) = 0 otherwise. 


For each i = 1,...,k and each j = 1,...,n, define 


Ajj _— {2;=4}, 


_ 1. if Age occurs, 
Wai 0 if not. 
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n 
Finally, define V; = sy W,; fori =1,...,k. It follows from the discussion in the text that (X1,..., Xx) 
j=l 
has the same distribution as (V;,...,V%). Hence Y has the same distribution as U = V, +---+ Ve. But 


kon me £ 
U=V4+---+Ve=S >> Wy =>0 > Wi. 


i=1 j=l j=li=l1 


L 
Define U; = + Wp It is easy to see that U; = 1 if UL Ai occurs and U; = 0 if not. Also 
i=l 
Pr(Uf, Ais) =p, +:--+ pe. Hence, Uy,...,U, are ii.d. random variables each having a Bernoulli 
n 


distribution with parameter p,+---+pg. Since U = a U;, we know that U has the binomial distribution 
i=1 
with parameters n and p; +---+ pg. 


. The probability that a given observed value will be less than a; is py = F(a,) = 0.3, the probability 
that it will be between a; and ag is pp = F(a2) — F(a,) = 0.5, and the probability that it will be 
greater than ag is px = 1 — F(ag) = 0.2. Therefore, the numbers of the 25 observations in each of 
these three intervals will have the multinomial distribution with parameters n = 25 and p = (pj, po, p3). 
Therefore, the required probability is 


25! 6 10 9 
Sort (o'8) (0.5)"" 0.2)". 
. Let X; denote the number of times that the number 1 appears, let X2 denote the number of times 
that the number 4 appears, and let X3 denote the number of times that a number other than 1 or 4 
appears. Then the vector (X 1, X2, X3) has the multinomial distribution with parameters n = 5 and 
p = (1/6,1/6,4/6). Therefore, 


Pr(Xy = X2) Pr(X =0, X2=0, X3= 5) + Pr Xy = 1, x5 = 1, X3= 3) 


+ Pr(Xy =2, X29 =2, X3= 1) 

Chri loloe GPGkG. 
= (=) +—~(=)(-)(-) += (=) (-) [5 

6 1!1!3! \6 6 6 212'1! \6 6 6 
1024 =1280 120 _ 2424 
eo ' @ ee. 


. Let X3 denote the number of rolls for which the number 5 appears. If X, = 20 and X» = 15, then 
it must also the true that X3 = 5. The vector (X1, Xo, X3) has the multinomial distribution with 
parameters 


n = AO, 

qd = potpa+ pe = 0.30 + 0.05 + 0.07 = 0.42, 
g = pr +p3 = 0.114 0.22 = 0.33, 

gg = ps = 0.25. 


Therefore, 


40! 


Pr(Xy = 20 and Xo = 15) = Pr(X, = 20, X2 = 15, X3 = 5) = DOM5I5! 


(0.42)""(0.53)" (O25). 
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5. The number X of freshman or sophomores selected will have the binomial distribution with parameters 
n= 15 and p=0.16+4+ 0.14 = 0.30. Therefore, it is found from the table in the back of the book that 


Pr(X > 8) = .0348 + .0116 + .0030 + .0006 + .0001 = .0501. 


6. By Eq. (5.9.3) 


E(X3) = 15(0.38) = 5.7, 

E(X4) = 15(0.32) =4.8, 

Var(X3) = 15(0.38)(0.62) = 3.534, 

Var(X1) = 15(0.32)(0.68) = 3.264. 
By Eq. (5.9.3), 


Cov(X3, X4) = —15(0.38)(0.32) = —1.824. 
Hence, 
E(X3 — X4) =5.7-48 =0.9 
and 
Var(X3 — X4) = 3.534 + 3.264 — 2(—1.824) = 10.446. 


7. For any nonnegative integers 71,...,7, such that eo ti = 1s 


Pr (2 =a Ke=a 


Since Xj1,...,X, are independent, 
Pray = tip Ae = oy) = Pel Ha) a Pe Xe = ae 
Since X; has the Poisson distribution with mean Aj, 


exp(—A,)\¥" 


Also, by Theorem 5.4.4, the distribution of ae X; will be a Poisson distribution with mean \ = 
yok, Ai. Therefore, 


__ exp(—A)A” 


Pr (3x =») = 


It follows that 


n! 


Pr (X= at Xe =a 
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8. Let the data be called X = (Xj, X2, X3), with X, being the number of working parts, X2 being the 
number of impaired parts, and X3 being the number of defective parts. The conditional distribution 
of X given p is a multinomial distribution with parameters 10 and p. So, the conditional p.f. of the 
observed data is 


10 
2 = °, D3. 
9(8, ,0|p) (eas 


The joint p.f./p.d.f. of X and p is the product of this with the p.d.f. of p: 


10 
19 (. 5, 2 ee = 540p1°p3. 


To find the conditional p.d.f. of p given X = (10, 2,0), we need to divide this expression by the marginal 
p.f. of X, which is the integral of this last expression over all (p1,p2) such that p; > 0 and p; + po < 1. 
This integral can be written as 


T(11)0(4) 


= 0.0450. 
T(15) 


1 1—p 1 
| | 540p1° p3dpodp1 = | 180p!°(1 — p;)> = 180 
0 JO 0 
For the second equality, we Theorem 5.8.1. So, the conditional p.d.f. of p given X = (10, 2,0) is 


12012p}°p2 if 0 < pi,po < 1 and p) + po < 1, 
0 otherwise. 


5.10 The Bivariate Normal Distributions 


Commentary 


The joint distribution of the least squares estimators in a simple linear regression model (Sec. 11.3) is a 
bivariate normal distribution, as is the posterior distribution of the regression parameters in a Bayesian 
analysis of simple linear regression (Sec. 11.4). It also arises in the regression fallacy (Exercise 19 in Sec. 11.2 
and Exercise 8 in Sec. 11.9) and as another theoretical avenue for introducing regression concepts (Exercises 2 
and 3 in Sec. 11.9). The derivation of the bivariate normal p.d.f. relies on Jacobians from Sec. 3.9 which the 
instructor might have skipped earlier in the course. 


Solutions to Exercises 


1. The conditional distribution of the height of the wife given that the height of the husband is 72 inches is 
a normal distribution with mean 66.8 +0.68 x 2(72—70)/2 = 68.16 and variance (1—0.687)2? = 2.1504. 
The 0.95 quantile of this distribution is 


68.16 + 2.1504'/2@-1(0.95) = 68.16 + 1.4664 x 1.645 = 70.57. 


2. Let X, denote the student’s score on test A and let X2 denote his score on test B. The conditional 


80 — 85 
distribution of X2 given that X, = 80 is a normal distribution with mean 90 + (0.8)(16) ( 10 ) = 


83.6 and variance (1 — 0.64)(256) = 92.16. Therefore, given that X, = 80, the random variable 
Z = (X2 — 83.6)/9.6 will have the standard normal distribution. It follows that 


2 2 
Pr(X_ > 90|X, = 80) = Pr (z = =) =1-6 (5) = 0.2524. 
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The sum X, + X. will have the normal distribution with mean 85 + 90 = 175 and variance (10)? + 
(16)? + 2(0.8)(10)(16) = 612. Therefore, Z = (X1 + X2 — 175)/24.7386 will have the standard normal 
distribution. It follows that 


Pr(X, + X_ > 200) = Pr(Z > 1.0106) = 1 — (1.0106) = 0.1562. 


The difference X,; — X will have the normal distribution with mean 85 — 90= — 5 and variance 
(10)? + (16)? — 2(0.8)(10)(16) = 100. Therefore, Z = (X, — X_ +5)/10 will have the standard normal 
distribution. It follows that 


Pr(X) > X2) = Pr(X1 — X2 > 0) = Pr(Z > 0.5) = 1 — 8(0.5) = 0.3085. 


The predicted value should be the mean of the conditional distribution of X, given that X» = 100. 
100 — 90 

This value is 85 + (0.8)(10) (=~) = 90. The M.S.E. for this prediction is the variance of the 

conditional distribution, which is (1 — 0.64)100 = 36. 


Var(X1+bX2) = 0? +0703 +2bpc102. This is a quadratic function of b. By differentiating with respect 
to b and setting the derivative equal to 0, we obtain the value b = —po 1/09. 


Since E(X1|X2) = 3.7 — 0.15.X2, it follows from Eq. (5.10.8) that 
. o 
(i) pu — phe = 3.7, 
O2 
(ii) po = —0.15. Since E(X2|X1) = 0.4 —0.6.X1, it follows from Eq. (5.10.6) that 
O2 
oe Oo: 
(iii) 2 — p— [U1 =(A, 
o1 
(iv) pal = —0.6. 
O71 
Finally, since Var(X2|X1) = 3.64, it follows that 
(v) (1 — p*)o? = 3.64. 


By multiplying (ii) and (iv) we find that p? = 0.09. Therefore, p = +0.3. Since the right side of (ii) 
is negative, p must be negative also. Hence, p = —0.3. It now follows from (v) that 0? = 4. Hence, 
o2 = 2 and it is found from (ii) that 0; = 1. By using the values we have obtained, we can rewrite (i) 
and (iii) as follows: 


(i) uy + 0.15p2 = 3.7, 


(iii) 0.644 + w2 = 0.4. 
By solving these two simultaneous linear equations, we find that yw, = 4 and pg = —2. 


. The value of f(x1, 22) will be a maximum when the exponent inside the curly braces is a maximum. In 


turn, this exponent will be a maximum when the expression inside the square brackets is a minimum. 
If we let 


_— t1— _ %2— M2 
ay = —— and ag = ——, 
O71 02 


then this expression is 


2 2 
ay — 2payaz + a5. 


10. 
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We shall now show that this expression must be nonnegative. We have 
0 < (lai| — laa)? = a2 + a3 — 2Jara9| < a? + a3 — 2\paras|, 

since |p| < 1. Furthermore, |paiag| > paiag. Hence, 
O< ay = a — 2paj,ag. 


The minimum possible value of a? — 2pa,ag + az is therefore 0, and this value is attained when a, = 0 
and az = 0 or, equivalently, when x1 = yp and x2 = pa. 


. Let a; and az be as defined in Exercise 8. If f(21, x22) = k, then a? — 2pa,a2 + a3 = b?, where b is a 


particular positive constant. Suppose first that p = 0 and oj = 02 =o. Then this equation has the 
form 


(21 — pn)? + (ae — pa)? = bo". 


This is the equation of a circle with center at (1,12) and radius bo. Suppose next that p = 0 and 
01 #09. Then the equation has the form 


(e1— 11)? , (f2—pe)? oo 

ee ee 
This is the equation of an ellipse for which the center is (fi, 42) and the major and minor axes are parallel 
to the x; and x2 axes. Suppose finally that p 4 0. It was shown in Exercise 8 that a? — 2pa,a2 +3 > 0 
for all values of ay and ag. It therefore follows from the methods of analytic geometry and elementary 
calculus that the set of points which satisfy the equation 


(a1 — pn)? _ opt — 11) ; (x2 — 12) a (x2 — pl2)” =P. 


oe O71 02 o3 


will be an ellipse for which the center is ({41, 42) and for which the major and minor axes are rotated 
so that they are not parallel to the x; and x2 axes. 


Let A = det eae Since A ¥ 0, the transformation from X, and X2 to Y; and Y% is a one-to-one 
21 a22 


transformation, for which the inverse transformation is: 


il 

Xj = A la2a(yn — bi) — ai2(Y2 — be)], 
1 

X2 = Alaa — by) + ag2(Yo — b2)). 


The joint p.d.f. of Y; and Y2 can therefore be obtained by replacing x; and x2 in f(x1,x2) by their 
expressions in terms of y; and y2, and then multiplying the result by the constant 1/|A|. After a great 
deal of algebra the exponent in this joint p.d.f. can be put into the following form: 


mars [SEY AS) ) +E), 
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where 

my = E(Y1) = aij + ai2pe + b1, 

mg = E(Y2) = agi py + 22H + bo, 
s? = Var(Y,) = a2, 0? + a2,02 + 2ar1a12p0102, 
si = Var(Y2) = ad,o7 + a3,0% + 2a21422p0102, 

Cov(Y,, ¥: il 
ro Sota) = —[ay1a2107 + (a11422 + a12421)p0102 + 75a. 
8182 5152 


It can then be concluded that this joint p.d.f. is the p.d.f. of a bivariate normal distribution for which 


the 


means are m, and mg, the variances are s? and s3, and the correlation is r. 


11. By Exercise 10, the joint distribution of X; + X2g and X, — X9 is a bivariate normal distribution. By 
Exercise 9 of Sec. 4.6, these two variables are uncorrelated. Therefore, they are also independent. 


12. (a) 


13. The 


For the first species, the mean of a,X 1 + a2X9 is 201la; + 118a2, while the variance is 
15.27a? + 6.67a3 + 2 x 15.2 x 6.6 x 0.64a1a9. 


The square-root of this is the standard deviation, (231.04a7 + 43.56a3 + 128.41a,a2)'/?. For the 
second species, the mean is 187a; + 13la2. The standard deviation will be the same as for the 
first species because the values of 01, o2 and p are the same for both species. 


At first, it looks like we need a two-dimensional maximization. However, it is clear that the ratio 
in question, namely, 


—14a, + 13a 


(931 0d? + 43 56q2 + 198 Alaiae)t/2 e512 
(231.040? + 43.5603 + 128.4101 49)!72 (8.5.12) 


will have the same value if we multiply both a, and ag by the same positive constant. We could 
then assume that the pair (a1, a2) lies on a circle and hence reduce the maximization to a one- 
dimensional problem. Alternatively, we could assume that a;-+a2 = 1 and then find the maximum 
of the square of (S.5.12). (We would also have to check the one extra case in which a, = —a2 to 
see if that produced a larger value.) We shall use this second approach. If we replace az by 1— a1, 
we need to find the maximum of 


(13 — 27a,)? 
231.04a? + 43.56(1 — a1)? + 128.41a;(1 — a1) 


The derivative of this is the ratio of two polynomials, the denominator of which is always positive. 
So, the derivative is 0 when the numerator is 0. The numerator of the derivative is 13 — 27a, 
times a linear function of aj. The two roots of the numerator are 0.4815 and —0.5878. The first 
root produces the value 0 for (S.5.12), while the second produces the value 3.456. All pairs with 
a, = —da lead to the values +2.233. So ay = —0.5878 and ag = 1.5878 provide the maximum of 
(S.5.12). 


exponent of a bivariate normal p.d.f. can be expressed as —[ax? + by? + cry + ex + gy + h], where 
1 
a= ‘ 
207 (1 — p?) 
1 
b= 2 2)? 
205(1 — p?) 
bes p 


14. 


€ = ee + 
o7(1 — p*) 

g Es oe + 
03 (1 — p?) 
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[2p 


gical lp) 


[1p 


o102(1 — p?)’ 


and h is irrelevant because exp(—h) just provides an additional constant factor that we are ignoring 
anyway. The only restrictions that the bivariate normal p.d.f. puts on the numbers a, b, c, e, and g 
are that a,b > 0 and whatever is equivalent to |p| < 1. It is easy to see that, so long as a,b > 0, we 
will have |p| < 1 if and only if ab > (c/2)?. Hence, every set of numbers that satisfies these inequalities 
corresponds to a bivariate normal p.d.f. Assuming that these inequalities are satisfied, we can solve the 
above equations to find the parameters of the bivariate normal distribution. 


c/2 
p= Top 1/2” 
(ab) 
o = = 
. 2a — c?/[2b)’ 
of = —— 
2b — c?/[2a]’ 
_ cg — 2be 
a SOP 
_ ce—2ag 
pe hb 


The marginal p.d.f. of X is 


1 1 
Jie) = Gratz oP (-sle _ H) 


where ys and o? are the mean and variance of X. The conditional p.d.f. of Y given X = x is 


1 iL 
go(y|x) = (Qnr2yt72 OP (-s0 — ax — uP) : 


The joint p.d.f. of (X,Y) is the product of these two 


f(x,y) = 5— exp (- 
where 
; 1 4 a? 
a — 
Qo02 27?’ 
il 
(= —~ 
272’ 
_ a 
Cc = = 
_ fb ab 
SS ee 
b 
g = = 


a’a? + by? + cry +ex + gy+ nl) ; 
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and h is irrelevant since we are going to apply the result from Exercise 13. Clearly a’ and b! are positive. 
We only need to check that a’b! > (c/2)?. Notice that 
1 a 1 
pI 2 
b = —s53 + — = (c/2 —s, 
“ 4q272 7 4r2 ey 40272 
so the conditions of Exercise 13 are met. 


(a) Let Y = > X;. Since Xj,...,Xp are independent, we know that Y is independent of X;. Since 
j#i 
Y is the sum of independent normal random variables it has a normal distribution. The mean 
and variance of Y are easily seen to be (n — 1) and (n — 1)o? respectively. Since Y and X; are 
independent, all pairs of linear combinations of them have a bivariate normal distribution. Now 
write 


XxX; = 1X;+0Y, 

= 1 1 

Xn = —-X;+—-Y. 
nr n 


Clearly, both X; and Y have mean jp, and we already know that X; has variance a? while Y has 
variance o?/n. The correlation can be computed from the covariance of the two linear combina- 
tions. 
1 1 1 
Cov (1x, 0¥, —Xe4+ “y) = 9", 
n n n 
The correlation is then (0?/n)/[o2a0?/n]'/? = 1/n'/?. 
The conditional distribution of X; given X,, = Fp, is a normal distribution with mean equal to 


1 o = 
w+ Rg faie — p) = En. 


The conditional variance is 


(3) 


Ss 


5.11 Supplementary Exercises 


Solutions to Exercises 


1. 


Let gi(a|p) be the conditional p.f. of X given P = p, which is the binomial p.f. with parameters n and 
p. Let fo(p) be the marginal p.d.f. of P, which is beta p.d.f. with parameters 1 and 1, also known as 
the uniform p.d.f. on the interval [0,1]. According to the law of total probability for random variables, 
the marginal p.f. of X is 


file) = f n(elp)fale)ap = [ ("ora — p)""*dp = (") ae). 2 


a} (n+l)! n41’ 


for x = 0,...,n. In the above, we used Theorem 5.8.1 and the fact that ['(k + 1) = k! for each integer 
hy 


. The random variable U = 3X + 2Y — 6Z has the normal distribution with mean O and variance 


37 +2? + 6? = 49. Therefore, Z = U/7 has the standard normal distribution. The required probability 
is 


Pr(U < —7) =Pr(Z < -1) =1— (1) = .1587. 
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3. Since Var(X) = E(X) = 1 and Var(Y) = E(Y) = 2, it follows that 41 + Ag = 5. Hence, X + Y has 


the Poisson distribution with mean 5 and 
Pr(X + Y < 2)=Pr(X+Y =0)4+ Pr(X + Y =1) = exp(—5) + 5exp(—5) = .0067 + .0337 = .0404. 


. It can be found from the table of the standard normal distribution that 116 must be .84 standard 
deviations to the left of the mean and 328 must be 1.28 standard deviations to the right of the mean. 
Hence, 


— 840 = 116, 
fbet+1.280 = 328. 


Solving these equations, we obtain js = 200 and o = 100, a? = 10,000. 


. The event {X < 1/2} can occur only if all four observations are 0, which has probability (exp(—A))*, 


or three of the observations are 0 and the other is 1, which has probability 4(\ exp(—A))(exp(—A))?. 
Hence, the total probability is as given in this exercise. 


. If X has the exponential distribution with parameter {, then 


25 = Pr(X > 1000) = exp(—(1000)). 


i 1 1000 
iim ES = 
0g ee 


. It follows from Exercise 18 of Sec. 4.9 that 


Hence, 8 = 


E[(X — p)*] = E(X*) — 8p0? — p?. 


Because of the symmetry of the normal distribution with respect to jz, the left side of this relation is 
0. Hence, 


EX Sone 4. 


. X and Y have independent normal distributions with the same mean pz, and Var(X) = 144/16 = 
9, Var(Y) = 400/25 = 16. Hence, X — Y has the normal distribution with mean 0 and variance 
9+ 16 = 25. Thus, Z = (X —Y)/5 has the standard normal distribution. It follows that the required 
probability is Pr(|Z| < 1) = .6826. 


. The number of men that arrive during the one-minute period has the Poisson distribution with mean 
2. The number of women is independent of the number of men and has the Poisson distribution with 
mean 1. Therefore, the total number of people X that arrive has the Poisson distribution with mean 
3. From the table in the back of the book it is found that 


Pr(X < 4) = .0498 + .1494 + .2240 + .2240 + .1680 = .8152. 
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12. 


13. 


14. 
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by (t) = E(exp(ty)) = Elexp(tX1 + +--+ tXy)] 
= E{Elexp(tX; +---+tXy)|N]} 


= E{p()|™} 
= x exp(—A)A* 


= Yworx es 
= exp(-r) So ROOF 


! 
r=0 x. 


= exp(—A) exp(Au(t)) = exp{ l(t) — 1}. 


The probability that at least one of the two children will be successful on a given Sunday is (1/3) + 
(1/5) —(1/3)(1/5) = 7/15. Therefore, from the geometric distribution, the expected number of Sundays 
until a successful launch is achieved is 15/7. 


For any positive integer n, the event X > n will occur if and only if the first n tosses are either all 
heads or all tails. Therefore, 


1\” 1\” 1 n-1 
P Xx — = _ st — 
and, for n = 2,3,.... 


Pr(X =n) =Pr(X >n-1)—Pr(X > n) = (3) — (Gy = cu ; 
Hence, 


1 x—1 
(5) for « = 2,3,4,... 


0 otherwise. 


By the Poisson approximation, the distribution of X is approximately Poisson with mean 120(1/36) = 
10/3. The probability that such a Poisson random variable equals 3 is exp(—10/3)(10/3)3 /3! = 0.2202. 
(The actual binomial probability is 0.2229.) 


It was shown in Sec. 3.9 that the p.d.f.’s of Y1, Y,, and W, respectively, are as follows: 


aty) = n(l—y? for 0 << y < 1, 
ny) =a for0<y <1, 
Ay(w) = n(n—1)w"*(1—w) forO<w <1. 


Each of these is the p.d.f. of a beta distribution. For gj,a@ = 1 and 6 =n. For g,, a=n and 6 = 1. 
For hy, a=n—1land B=2. 


(a) Pr(Z, > t) = Pr(X = 0), where X is the number of occurrences between time 0 and time t. Since 
X has the Poisson distribution with mean 5t, it follows that Pr(T, > t) = exp(—5t). Hence, T; 
has the exponential distribution with parameter 6 = 5. 


16. 


17. 


18. 


19. 


20. 


21: 
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(b) 7, is the sum of & ii.d. random variables, each of which has the exponential distribution given 
in part (a). Therefore, the distribution of T; is a gamma distribution with parameters a = k and 
B=5. 

(c) Let X; denote the time following the ith occurrence until the (i + 1)st occurrence. Then the 
random variable Xj,...,X,—1 are i.i.d., each of which has the exponential distribution given in 
part (a). Since ¢ is measured in hours in that distribution, the required probability is 


1 
Pr (x > 37 =lA,secgk— 1) = (exp(—5/3))*"1. 


We can express 75 as 7; + V, where V is the time required after one of the components has failed until 
the other four have failed. By the memoryless property of the exponential distribution, we know that 
T, and V are independent. Therefore, 


1 
Cov(T,, T5) = Cov(1%, 7 + V) = Cov(T1, 71) + Cov(T1, V) = Var(T,) +0 = Bae 
since 7) has the exponential distribution with parameter 5. 
Pr(X1 > kX2) = i Prix > kXe| Xo — 2) (for de = : exp(—(6, kx) Bo exp(—fox)dx = Pe 
0 0 kB, + Be 


Since the sample size is small relative to the size of the population, the distribution of the number X 
of people in the sample who are watching will have essentially the binomial distribution with n = 200 
and p = 15000/500000 = .03, even if sampling is done without replacement. This binomial distribution 
is closely approximated by Poisson distribution with mean 4 = np = 6. Hence, from the table in the 
back of the text, 


Pr(X < 4) = .0025 + .0149 + .0446 + .0892 = .1512. 
It follows from Eq. (5.3.8) that 


(x) _pd-p) T-n 


_ 1 
Var(X) = — Var Po. 
n n - 


where T is the population size, p is the proportion of persons in the population who have the charac- 
teristic, and n = 100. Since p(1 — p) < 1/4 for 0 << p< 1 and (T —n)(T—1) < 1 for all values of T, it 
follows that 


Hence, the standard deviation is < 1/20 = .05. 


Consider the event that less than r successes are obtained in the first n Bernoulli trials. The left side 
represents the probability of this event in terms of the binomial distribution. But the event also means 
that more than n trials are going to be required in order to obtain r successes, which means that more 
than n —r failures are going to be obtained before r successes are obtained. The right side expresses 
this probability in terms of the negative binomial distribution. 


Consider the event that there are at least k occurrences between time 0 and time ¢t. The number X 
of occurrences in this interval has the specified Poisson distribution, so the left side represents the 
probability of this event. But the event also means that the total waiting time Y until the kth event 
occurs is < t. It follows from part (b) of Exercise 15 that Y has the specified gamma distribution. 
Hence, the right side also expresses the probability of this same event. 
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22. It follows from the definition of h(x) that 
a oo 
Therefore, 
ee |- | h(t) a =i. 
0 


23. (a) It follows from Theorem 5.9.2 that 


1/2 
(Xi, X;) = Cov(Xi, Xj) —l0pip; __ (_ Pi 
eT [Var(X;) Var(X;)]/?2 — 10[p:(1 — pi)pj(1 — ps)? logy l= py 
b) p(X;,X;) is most negative when p; and p; have their largest values; i.e., for 7 = 1 (p; = .4) and 
j j 
j =2 (po =.3). 
c) p(X;,X,;) is closest to 0 when p; and p; have their smallest values; i.e., for 7 = 3 (p3 = .2) and 
(4 = 1) 
j =4 Ppa = 1). 


24. It follows from Theorem 5.10.5 that X 1 — 3X. will have the normal distribution with mean jz — 32 
and variance o? + 903 — 6pcj0%. 


25. Since X has a normal distribution and the conditional distribution of Y given X is also normal with a 
mean that is a linear function of X and constant variance, it follows that X and Y jointly have a bivariate 
normal distribution. Hence, Y has a normal distribution. From Eq. (5.10.6), 2X — 3 = pe + pooX. 
Hence, jg = —3 and pay = 2. Also, (1 — p?)o3 = 12. Therefore o? = 16 and p= 1/2. Thus, Y has the 
normal distribution with mean —3 and variance 16, and p(X,Y) = 1/2. 


26. We shall use the relation 
E(X{X2) = E[E(X7X2 | X2)| = B[X2E(X7 | X2)). 


But 


2 
oO 
E(X{|X2) = Var(X1|X2) + [E(X1|X2))? = (1 — p* of + (10 - p2X;) 


Hence, 


2 
Oo oO 
X_E(X7|X2) = (1— p*)of Xo + p{Xo + 21 — X32 + (6%) X3. 


The required value E(X?X2) is the expectation of this quantity. But since X2 has the normal distri- 
bution with E(X2) = 0, it follows that E(X2) = 03 and E(X3) =0. 


Hence, 


E(X?2X2) = 2010102. 


Chapter 6 


Large Random Samples 


6.1 


Introduction 


Solutions to Exercises 


I 


The p.d.f. of Y = X1 + Xo is 


y if0<y<1, 
gy)=4 2-y ifl<y<2, 
0 otherwise. 


It follows easily from the fact that X2 = Y/2 that the p.d-f. of X2 is 


Ax ro<2< 1/2, 
h(a)=4 4-4¢ if 1/2<2<1, 
0 otherwise. 


We easily compute 


Pr(|X1 — 0.5| < 0.1) 0.6 —0.4=0.2, 


Pr(|X2—0.5|<0.1) = [ “nde + (4 — 4x)dx 
205° = 0.47) 4 4106 —0.5) = 20.6 — 0.57) = 036: 


The reason that X» has higher probability of being close to 0.5 is that its p.d.f. is much higher near 
0.5 than is the uniform p.d.f. of X, (twice as high right at 0.5). 


. The distribution of X,, is (by Corollary 5.6.2) the normal distribution with mean py and variance o?/n. 


By Theorem 5.6.6, 


Pr([Xn — w| <c) Pr(Xpn <c)—- 


7 ae Wess (S.6.1) 


As n > 00, ¢/(a/n!/2) > 00 and —c/(a/n'/?) > —oo. It follows from Property 3.3.2 of all c.d.f.’s that 
(S.6.1) goes to 1 as n > on. 


. To do this by hand we would have to add all of the binomial probabilities corresponding to W = 


80,...,120. Most statistical software will do this calculation automatically. The result is 0.9964. It 
looks like the probability is increasing to 1. 
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6.2 The Law of Large Numbers 


Commentary 

The discussion of the strong law of large numbers at the end of the section might be suitable only for the 
more mathematically inclined students. 

Solutions to Exercises 


1. Let € > 0. We need to show that 
Jim. Pr(|X,, — 0| > «) = 0. (S.6.2) 
Since X,, > 0, we have |X, —0| > ¢ if and only if X,, > €. By the Markov inequality Pr(X, > €) < pn/e. 
Since Jim, Jin, = 0, Eq. (8.6.2) holds. 
2. By the Markov inequality, 
E(X) > 10Pr(X > 10) =2. 


3. By the Chebyshev inequality, 
9 
Var(X) > 9Pr(|X — p| > 3) =9Pr(X <7 or X > 13) = 9(0.2 + 0.3) = o 


4. Consider a distribution which is concentrated on the three points wu, + 30, and ys — 30. Let Pr(X = 
ft) = pi, Pr(X = w+ 3c) = po, and Pr(X = pw — 30) = ps. If we are to have E(X) = p, then we must 
have pp = p3. Let p denote the common value of pz and p3. Then p, = 1 — 2p, because p; + po + p3 = 1. 
Now 


Var(X) = E[(X — p)*] = 90? (p) + 90 (p) + 0(1 — 2p) = 1807p. 


Since we must have Var(X) = a”, then we must choose p = 1/18. Therefore, the only distribution which 
is concentrated on the three points p, + 30, and ys — 30, and for which E(X) = p and Var(X) = 07, 


is the one with p; = 8/9 and pz = p3 = 1/18. It can now be verified that for this distribution we have 


1 1 1 
Pr(|X — p] 2 80) = Pr(X = w+ 30) + Pr(X =p —30)= 70 +70 = 5: 


5. By the Chebyshev inequality, 


1 
Pr(|Xn — p| < 20) >1- —. 
4n 


1 
Therefore, we must have 1 — i > 0.99. or n > 25. 
n 


6. By the Chebyshev inequality, 


= = 1 16 
Pr(0<Xq <7) =Pr(n— wl <5) S1-—. 
n 


16 
Therefore, we must have 1 — — > 0.8 or n > 80. 
n 
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7. By the Markov inequality, 


8. 


9. 


10. 


alle 


E(\|X — pl* 
Pr(|X — p| >t) = Pr(|X — pl* > t*) < Fun = . 
(a) In this example E(Q,,) = 0.3 and Var(Q,) = (0.3)(0.7)/n = 0.21/n. Therefore, 


Oot. “Bil 
n(0.01) — n’ 


Pr(0.2 < Qn < 0.4) = Pr(|Qn — E(Qn)| < 0.1) >1- 


21 
Therefore, we must have 1 — — > 0.75 or n > 84. 
n 


(b) Let X,, denote the total number of items in the sample that are of poor quality. Then X, = nQ,, 
and 


Pr02 <Q, = 04) = Pri0.2n = A, =< 0.4n). 


Since X,, has a binomial distribution with parameters n and p = 0.3, the value of this probability 
can be determined for various values of n from the table of the binomial distribution given in the 
back of the book. For n = 15, it is found that 


PrO2n < XxX, <0An) = Ps < xX, <6) = 0.7419, 
For n = 20, it is found that 
Pr(0.20. < Ay < 0An) =Pr(4 < Xy < 8) = 0.7796. 


Since this probability must be at least 0.75, we must have n = 20, although it is possible that 
some value between n = 15 and n = 20 will also satisfy the required condition. 


1 1 
E(Z,) =n? -—+0 (1 — ~) =n. Hence, lim E(Z,,) = co. Also, for any given € > 0, 
nm n noo 


1 
Pr(|Z_| <e€) = PZ, =0) =1——. 
n 


Hence, lim Pr(|Z,| < €) = 1, which means that Z, * 0. 
N—- oo 
By Exercise 5 of Sec. 4.3, 
E(Zn — b)?] = [E(Z,) — b]? + Var(Z,). 


Therefore, the limit of the left side will be 0 if and only if the limit of each of the two terms on the 
right side is 0. Moreover, Jim [E(Zn) — b|? = 0 if and only if Jim, 2a) 0. 


Suppose that the sequence Z1, Z2,... converges to 6b in the quadratic mean. Since 
then for any value of € > 0, 


Pr(|Zn —b|<€) > Pr(|Z_ —E(Zn)| + |E(Zn) — 6] <6) 
= Pr(|Z, — E(Zn)| <€—|E(Zn) — Ol). 
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By Exercise 10, we know that Jim E(Z,,) = 6. Therefore, for sufficiently large values of n, it will be 


true that « —|E(Z,,) — b| > 0. Hence, by the Chebyshev inequality, the final probability will be at least 
as large as 


7 Var(Zp,) 
le — |E(Zn) — bP 


Again, by Exercise 10, 
lim Var(Z,) =O and lim |e— |E(Z,,) — ||? =e’. 
noo N00 


Therefore, 


Zn 
lim Pr(|Z, — 6] < «) > lim {1 ~ Var(Zn) \ = 1, 
noo N+ 00 


le — |E(Zn) — 6|)? 


which means that Z,, Fb. 


We know that E(X,)=p and Var(X,,)=07/n. Therefore, Jim, E(Xn)=p and Jim, Var(Xpn) = 0. 
The desired result now follows from Exercise 10. 


(a) For any value of n large enough so that 1/n < €, we have 
1 1 
Pr(|Zal <6) =Pr Jae =e 
Therefore, lim Pr(|Zn| < €) = 1, which means that Z, + 0. 
noo 
1 1 1 2 1 , : 
(b) E(Z,) =-—(1- =) +n |) =—-—-~z- Therefore, lim E(Z,) = 0. It follows from Exercise 10 
n n n nn n-300 
that the only possible value for the constant ¢ is c = 0, and there will be convergence to this value 
if and only if lim Var(2,)=0. But 
nN? Co 


1 1 1 
B(22) = (1-5) +n? S14 5-5 


n2 nt 


1 1 a As? 

Hence, Var(Z,) =1+—4-— (- =) and Jim Var(Zn) = 1, 

Let X have p.f. equal to f. Assume that Var(X) > 0 (otherwise it is surely less than 1/4). First, 
suppose that X has only two possible values, 0 and 1. Let p= Pr(X =1). Then E(X) = E(X?) =p 
and Var(X) = p—p’. The largest possible value of p—p? occurs when p = 1/2, and the value is 1/4. So 
Var(X) < 1/4 if X only has the two possible values 0 and 1. For the remainder of the proof, we shall 
show that if X has any possible values strictly between 0 and 1, then there is another random variable 
Y taking only the values 0 and 1 and with Var(Y) > Var(X). So, assume that X takes at least one 
value strictly between 0 and 1. Without loss of generality, assume that one of those possible values is 
between 0 and yw. (Otherwise replace X by 1 — X which has the same variance.) Let = E(X), and 
let 21, 22,... be the values such that x; < w and f(a;) > 0. Define a new random variable 


SEA Se 
= ={ 5 if X > p. 


15. 
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The p.f. of X* is 


Fa) for all x > yp, 
Pa=< Sie) ere =o, 
0 otherwise. 


The mean of X* is u* = u — 1, aif (z;). The mean of X*? is E(X*) — 1, x? f(2;). So, the variance of 
X™* is 


2 
Var(X*) = E(X?)— Deaif (wi) = c = Dasa) 
2 
= Var(X) — De a? f (aj) + aud) vif (xi) — Ss nla) : (S.6.3) 
since x; < ys for each 7, we have 
— doi f(a) + 2p) | wif (xi) = exis (wi). (S.6.4) 


Let ¢ = 0; f(a) > 0. Then 


0 otherwise, 


Ga Feel Tors 4 Ri gies. ss}, 


is a p.f. Let Z be a random variable whose p.f. is g. Then 
1 
E(Z) = a do vif (vi); 
1 
Var(Z) = ri DHF (Hs) 


Since Var(Z) > 0 and ¢t < 1, we have 


2 2 
Datrte) > F | Dasteo] > Parton] | 


Combine this with (8.6.3) and (S.6.4) to see that Var(X*) > Var(X). If f*(a) > 0 for some = strictly 
between 0 and 1, replace X* by 1 — X* and repeat the above process to produce the desired random 
variable Y. 


We need to prove that, for every € > 0, 
Jim, Pr(|9(Zn) - 9) <0) =1. 


Let € > 0. Since g is continuous at 6, there exists 6 such that |z — b| < 6 implies that |g(z) — g(b)| < e. 
: P. 
Also, since Z, — b, we know that 


lim Pr(|Z, — | < 6) =1. 
Noo 
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But {|Z — 6] < 5} C {|9(Zn) — g(b)| < €}. So 
Pr(|g(Zn) — g(b)| < €) > Pr(|Z, — b| < 6) (8.6.5) 


Since the right side of (S.6.5) goes to 1 as n — oo so does the left side. 


The argument here is similar to that given in Exercise 15. Let € > 0. Since g is continuous at (b,c), 
there exists 5 such that ,/(z — b)? + (y—c)? < 6 implies |g(z,y) — g(b,c)| < €. Also, |z — b| < 6/V2 
and |y — cl < 6/./2 together imply /(z— 6)? + (y—c)? < 6. Let By = {|Z, — b| < 6/V2} and 
Cn = {|Yn —¢| < 6/V2}. It follows that 


BhNnCr C {|9g(Zn, Yn) — g(b,c)| < d}. (S.6.6) 
We can write 


Pr(BnNC,) = 1-—Pr([B,NC,]°) = 1— Pr( By UC;) > 1 — Pr( BZ) — Pr(Cy) 
Pr(B,,) + Pr(C;,) — 1. 


Combining this with (8.6.6), we get 
Pr(|g(Zn, Yn) — g(b,c)| < 6) > Pr(B,) + Pr(C;,,) — 1. 


Since Z, “> b and Y, 73 c, we know that both Pr(B,) and Pr(C,,) go to 1 as n + oo. Hence 
Pr(|g(Zn, Yn) — g(b, c)| < 6) goes to 1 as well. 


(a) The mean of X is np, and the mean of Y is np/k. Since Z = kY, the mean of Z is knp/k = np. 


(b) The variance of X is np(1—p), and the variance of Y is n(p/k)(1—p/k). So, the variance Z = kY 
is k? times the variance of Y, i.e., 


Var(Z) = k?n(p/k)(1 — p/k) = knp(1 — p/k). 


If p is small, then both 1 — p and 1 — p/k will be close to 1, and Var(Z) is approximately knp 
while the variance of X is approximately np. 


(c) In Fig. 6.1, each bar has height equal to 0.01 times a binomial random variable with parameters 
100 and the probability that X 1 is in the interval under the bar. In Fig. 6.2, each bar has height 
equal to 0.02 times a binomial random variable with parameters 100 and probability that X, is 
in the interval under the bar. The bars in Fig. 6.2 have approximately one-half of the probability 
of the bars in Fig. 6.1, but their heights have been multiplied by 2. By part (b), we expect the 
heights in Fig. 6.2 to have approximately twice the variance of the heights in Fig. 6.1. 


The result is trivial if the m.g.f. is infinite for all s > 0. So, assume that the m.g-f. is finite for at least 
some s > 0. For every t and every s > 0 such that the m.g.f. is finite, we can write 


Pr(X >t) = Pr(exp(sX) > exp(st) < oe = i(s) exp(—st), 


where the second equality follows from the Markov inequality. Since Pr(X > t) < v(s)exp(—st) for 
every s Pr(Y >t) < min, ¥(s) exp(—st). 
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19. (a) First, insert s from (6.2.15) into the expression in (6.2.14). We get 


_ Il+u 1— := 
n [loin + (—* +u) tog {OP -»)} -o | = ae) : 


up+1—p 
The last term can be rewritten as 
{ up+1—p \ 
— log ¢ 1 — ———___— 
(l+u)p+l—p 
The result is then 


n (<= +u) log { -»)} +log{(1+u)p+1 -}| : 


= —log(p) + log {(1 + u)p+1—p}. 


This is easily recognized as n times the logarithm of (6.2.16). 
(b) For all u, q is given by (6.2.16). For u = 0, q = (1—p)“-?)/”. Since 0 < 1—p < land (1—p)/p > 0, 
we have 0 < q < 1 when u=0. For general wu, let x = p(1+u) +1-—~p and rewrite 


“fF 1- + 
log(q) = log(p + ) + log Ce pps) 


Since x is a linear increasing function of u, if we show that log(q) is decreasing in x, then q is 
decreasing in u. The derivative of log(q) with respect to x is 
1 1- Zi 
iP sige p)(p + x) 
u(p+ax)  p x 

The first term is negative, and the second term is negative at u = 0 (x = 1). To be sure that the 
sum is always negative, examine the second term more closely. The derivative of the second term 
is 


1 ( 1 -) —1 
— | — - -] = —— . < 0 
D\prr (p+ 2x) 
Hence, the derivative is always negative, and q is less than 1 for all wu. 
20. We already have the m.g.f. of Y in (6.2.9). We can multiply it by e~*"/!° and minimize over s > 0. 
Before minimizing, take the logarithm: 
38 


log[th(s)e78"/ 1] =n fow(1/2) + loglexp(s) + 1] — ak (8.6.7) 


The derivative of this logarithm is 


exp(s) =| 


exp(s)+1 5 


The derivative is 0 at s = log(3/2), and the second derivative is positive there, so s = log(3/2) provides 
the minimum. The minimum value of (S.6.7) is —0.02014, and the Chernoff bound is exp(—0.02014n) = 
(0.98)" for Pr(Y >n/10). Similarly, for Pr(—Y > n/10), we need to minimize 


logfy(—s)e~*"/29) = ny [tog eee Cr eee = . (8.6.8) 


The derivative is 


—exp(—s) =| 


exp(—s)+1 5 
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which equals 0 at s = log(3/2). The minimum value of (S.6.8) is again —0.02014, and the Chernoff 
bound for the entire probability is 2(0.98)", a bit smaller than in the example. 


21. (a) The m.g.f. of the exponential distribution with parameter 1 is 1/(1—s) for s < 1, hence the m.g-f. 
of Y, is 1/(1 — s)” for t < 1. The Chernoff bound is the minimum (over s > 0) of e~"*/(1 — s)”. 
The logarithm of this is —n[us + log(1—s)], which is minimized at s = (u—1)/u, which is positive 
if and only if uw > 1. The Chernoff bound is [wexp(1 — wu)”. 


(b) If u < 1, then the expression in Theorem 6.2.7 is minimized over s > 0 near s = 0, which provides 
a useless bound of 1 for Pr(Y, > nu). 


22. (a) The numbers (k — 1)k/2 for k = 1,2,... form a strictly increasing sequence starting at 0. Hence, 
every integer n falls between a unique pair of these numbers. So, ky is the value of & such that n 
is larger than (k — 1)k/2 but no larger than k(k + 1)/2. 


(b) Clearly j, is the excess of n over the lower bound in part (a), hence j,, runs from 1 up to the 
difference between the bounds, which is easily seen to be ky. 


(c) The intervals where h,, equals 1 are defined to be disjoint for j, = 1,...,kn, and they cover the 
whole interval [0,1). Hence, for each x h,(a) = 1 for one and only one of these intervals, which 
correspond to n between the bounds in part (a). 


(d) For every x € (0,1), h,(x) = 1 for one n between the bounds in part (a). Since there are infintely 
many values of kn, hn(x) = 1 infintely often for every x € [0,1), and Pr(X € [0,1)) =1. 


(e) For every € > 0 |Z, — 0| > € whenever Z, = 1. Since Pr(Z, = 1 infinitely often) = 1, the 
probability is 1 that Z,, fails to converge to 0. Hence, the probability is 0 that Z, does converge 
to 0. 


(f) Notice that h,,(a) = 1 on an interval of length 1/k,. Hence, for each n, Pr(|Z, — 0| > €) = 1/kn, 
: P 
which goes to 0. So, Z, > 0. 


23. Each Z,, has the Bernoulli distribution with parameter 1/k,, hence E[(Z, — 0)?] = 1/kn, which goes 
to 0. 


24. (a) By construction, {Z,, converges to 0} = {X > 0}. Since Pr(X > 0) = 1, we have Z, converges to 
0 with probability 1. 


(b) E[(Z, — 0)?] = E(Z?2) = n4/n, which does not go to 0. 


6.3. The Central Limit Theorem 


Commentary 


The delta method is introduced as a practical application of the central limit theorem. The examples of the 
delta method given in this section are designed to help pave the way for some approximate confidence interval 
calculations that arise in Sec. 8.5. The delta method also helps in calculating the approximate distributions 
of some summaries of simulations that arise in Sec. 12.2. This section ends with two theoretical topics that 
might be of interest only to the more mathematically inclined students. The first is a central limit theorem 
for random variables that don’t have identical distributions. The second is an outline of the proof of the i.i.d. 
central limit theorem that makes use of moment generating functions. 
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Solutions to Exercises 


I; 


The length of rope produced in one hour X has a mean of 60 x 4 = 240 feet and a standard deviation 
of 60!/2 x 5 = 38.73 inches, which is 3.23 feet. The probability that X > 250 is approximately the 
probability that a normal random variable with mean 240 and standard deviation 3.23 is at least 250, 
namely 1 — (([250 — 240]/3.23) = 1 — 6(3.1) = 0.001. 


. The total number of people X from the suburbs attending the concert can be regarded as the sum of 


1200 independent random variables, each of which has a Bernoulli distribution with parameter p = 1/4. 
Therefore, the distribution of X will be approximately a normal distribution with mean 1200(1/4) = 300 
and variance 1200(1/4)(3/4) = 225. If we let Z = (X — 300)/15, then the distribution of Z will be 
approximately a standard normal distribution. Hence, 


Pr(X < 270) = Pr(Z < —2) ~ 1— (2) = 0.0227. 


. Since the variance of a Poisson distribution is equal to the mean, the number of defects on any bolt 


has mean 5 and variance 5. Therefore, the distribution of the average number X,, on the 125 bolts 
will be approximately the normal distribution with mean 5 and variance 5/125 = 1/25. If we let 
Z = (Xp —5)/(1/5), then the distribution of Z will be approximately a standard normal distribution. 
Hence, 


Pr(Xp, < 5.5) = Pr(Z < 2.5) ~ 8(2.5) = 0.9938. 


. The distribution of Z = /n(Xy—1)/3 will be approximately the standard normal distribution. There- 


fore, 
Pr(| Xn —pw| < 0.3) = Pr(|Z| <0.1/n) ~ 26(0.1/n) — 1. 


But 2®(0.1,/n)—1 > 0.95 if and only if 6(0.1,/n) > (1+0.95)/2 = 0.975, and this inequality is satisfied 
if and only if 0.1,/n > 1.96 or, equivalently, n > 384.16. Hence, the smallest possible value of n is 385. 


. The distribution of the proportion X,, of defective items in the sample will be approximately the 


normal distribution with mean 0.1 and variance (0.1)(0.9)/n = 0.09/n. Therefore, the distribution of 
Z = /n(Xy — 0.1)/0.3 will be approximately the standard normal distribution. It follows that 


Pr(Xn < 0.13) = Pr(Z < 0.1V/n) ~ (0.1/7). 


For this value to be at least 0.99, we must have 0.1,/n > 2.327 or, equivalently, n > 541.5. Hence, the 
smallest possible value of n is 542. 


. The distribution of the total number of times X that the target is hit will be approximately the nor- 


mal distribution with mean 10(0.3) + 15(0.2) + 20(0.1) = 8 and variance 10(0.3)(0.7) + 15(0.2)(0.8) + 
20(0.1)(0.9) = 6.3. Therefore, the distribution of Z = (X — 8)/V6.3 = (X — 8)/2.51 will be approxi- 
mately a standard normal distribution. It follows that 


Pr(X > 12) = Pr(Z > 1.5936) ~ 1 — 6(1.5936) = 0.0555. 


. The mean of a random digit X is 


1 
(OF14:--+9)=4.5. 


E(X) = — 
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Also, 


1 1 (9)(10)(19) 
WX) Ss 0 4 Pa a OS SN 8 
ce i0( ae spheres 10 6 
Therefore, Var(X) = 28.5 — (4.5)? = 8.25. The distribution of the average X, of 16 random digits 
will therefore be approximately the normal distribution with mean 4.5 and variance 8.25/16 = 0.5156. 
Hence, the distribution of 


_—Xn-45  Xyn—45 
~ 4/0.5156 —0.7181 


will be approximately a standard normal distribution. It follows that 


Pr(4 < Xy, <6) Pr(—0.6963 < Z < 2.0888) 
= (2.0888) — [1 — ©(0.6963)] 


0.9816 — 0.2431 = 0.7385. 


8. The distribution of the total amount X of 36 drinks will be approximately the normal distribution with 
mean 36(2) = 72 and variance 36(1/4) = 9. Therefore, the distribution of Z = (X — 72)/3 will be 
approximately a standard normal distribution. It follows that 


Pr(X < 63) = Pr(Z < —3) = 1— 6(3) = 0.0013. 


9. (a) By Eq. (6.2.4), 


Therefore, Pr (%, —p|< *) >1-— =0.36. 
(b) The distribution of 
Xe= fh . Bye 
T=) ee 
rs a 1) 


will be approximately a standard normal distribution. Therefore, 


= 5 
Pr ( = ale *) =Pr (12! < >) ~ 26(1.25) — 1 = 0.7887. 


10. (a) As in part (a) of Exercise 9, 
— os 16 
Pr(|X,-pl<—)>1-—. 
((Rn-ni< 2) 21-= 
Now 1 — 16/n > 0.99 if and only if n > 1600. 
(b) As in part (b) of Exercise 9, 


Pr (Xn ~ a < *) =Ps (Iz < ) = 20 (2) <i, 


Now 2 (/n/4) — 1 > 0.99 if and only if ® (.,/n/4) > 0.995. This inequality will be satisfied if and 
only if \/n/4 > 2.567 or, equivalently, n > 105.4. Therefore, the smallest possible sample size is 
106. 


dali 


12. 


13. 


14. 


15. 
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For a student chosen at random, the number of parents X who will attend the graduation ceremony has 
mean p = 0/3+1/3+ 2/3 = 1 and variance o? = E[(X — y)*] = (0—1)?/34 (1 —1)?/3 + (2—1)?/3 = 
2/3. Therefore, the distribution of the total number of parents W who attend the ceremony will 
be approximately the normal distribution with mean (600)(1) = 600 and variance 600(2/3) = 400. 
Therefore, the distribution of Z = (W —600)(20) will be approximately a standard normal distribution. 
It follows that 


Pr(W < 650) = Pr(Z < 2.5) = 0(2.5) = 0.9938. 


The m.g.f. of the binomial distribution with parameters n and py, is W(t) = (pn exp(t) +1— pp)”. If 
npn >, 


jim vn(t) = tm, (1+ "2*fexp(t) - 1)". 


This converges to exp(A[e’ — 1]), which is the m.g.f. of the Poisson distribution with mean 4. 


We are asking for the asymptotic distribution of g(X,), where g(x) = x°. The distribution of X,, is 
normal with mean @ and variance o?/n. According to the delta method, the asymptotic distribution of 
g(Xn) should be the normal distribution with mean g(@) = 6° and variance (a?/n)(q9'(0)]? = 96407/n. 


First, note that Y,, = 7f_, X?/n has asymptotically the normal distribution with mean o? and variance 
20*/n. Here, we have used the fact that E(X?) = 0? and E(X}) = 207. 


(a) Let g(x) = 1/x. Then g/(x) = —1/x?. So, the asymptotic distribution of g(Y;,) is the normal 
distribution with mean 1/0? and variance (204/n)/o® = 2/[no+}. 


(b) Let h(2) = 2mu?. If the asymptotic mean of Y;, is the asymptotic variance of Y,, is h(j)/n. So, 
a variance stabilizing transformation is 


where we have taken a = 1 to make the integral finite. So the asymptotic distribution of 
log(Y,)/2!/? is the normal distribution with mean 2log(a)/2!/? and variance 1/n. 


(a) Clearly, Y, < y if and only if X; < y fori =1,...,n. Hence, 


(y/0)"ifO<y <8, 
Pr, sw = Prog xy)" =< 0 iy <0, 
1 ify > 0. 


(b) The c.d-f. of Z, is, for z < 0, 
Pr(Z, < z) = Pr(¥n < 04 2/n) = (14+ 2z/[n6])”. (S.6.9) 


Since Z, < 0, the c.d-f. is 1 for z > 0. According to Theorem 5.3.3, the expression in (S.6.9) 
converges to exp(z/6). 


(c) Let a(y) = y?. Then a’/(y) = 2y. We have n(Y;, — 9) converging in distribution to the c.d.f. in 
part (b). The delta method says that, for 6 > 0, n(Y,2 — 0?) /[26] converges in distribution to the 
same c.d.f. 
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6.4 The Correction for Continuity 


Solutions to Exercises 


1. 


The mean of X; is 1 and the mean of X? is 1.5. So, the variance of X; is 0.5. The central limit theorem 
says that Y = X, +---+ X39 has approximately the normal distribution with mean 30 and variance 
15. We want the probability that Y < 33. Using the correction for continuity, we would assume that Y 
has the normal distribution with mean 30 and variance 15 and compute the probability that Y < 33.5. 
This is ®((33.5 — 30]/15!/2) = ©(0.904) = 0.8169. 


(a) E(X) = 15(.3) = 4.5 and ox = [(15)(.3)(.7)]'/* = 1.775. Therefore, 
3.5 —4.5 


1.775 
Pr(—.5634 < Z <0) © 0(.5634) — 5 & .214. 


a" 
Es 
~< 
l| 
hse, 
l| 


Pr(3.5. < X < 45) = Pr <Z<0) 


(b) The exact value is found from the table of binomial probabilities (n=15, p = 0.3,k = 4) to be 
.2186. 


. In the notation of Example 2, 


495.5 — 450 
ect ah 


Pr(H > 495) = Pr(H > 495.5) = Pr (z a 


) ~ 1 — 6(3.033) © .0012. 


. We follow the notation of the solution to Exercise 2 of Sec. 6.3: 


269.5 — 300 


Pr(X < 270) = Pr(X < 269.5) = Pr (z a 


) = 1 — (2.033) ~ .0210. 


. Let X denote the total number of defects in the sample. Then X has a Poisson distribution with mean 


5(125) = 625, so ax is (625)!/? = 25. Hence, 
Pr(xX,, <5.5)—PrX < 12565)| = Prix <6e7 5), 


Since this final probability is just the value that would be used with the correction for continuity, the 
probability to be found here is the same as that originally found in Exercise 3 of Sec. 6.3. 


. We follow the notation of the solution to Exercise 6 of Sec. 6.3: 


11.5-8 


Pr(X > 12) = Pr(X > 11.5) = Pr (z > 


) ~ 1 — (1.394) © .082. 


. Let S denote the sum of the 16 digits. Then 


E(S) = 16(4.5) = 72 and ox = [16(8.25)]!/? = 11.49. 


Hence, 
Pr(4< X,<6) = Pr(64<S < 96) = Pr(63.5 < S < 96.5) 
2 63.5 — 72 See 96.5 — 72 
11.49 —° — 11.49 


2 


®(2.132) — ®(—.740) & .9835 — .2296 = .7539. 
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6.5 Supplementary Exercises 


Solutions to Exercises 


1. By the central limit theorem, the distribution of X is approximately normal with mean (120)(1/6) = 20 
and standard deviation [120(1/6)(5/6)]'/? = 4.082. Let Z = (X — 20)/4.082. Then from the table of 
the standard normal distribution we find that Pr(|Z| < 1.96) = .95. Hence, & = (1.96)(4.082) = 8.00. 


2. Because of the property of the Poisson distribution described in Theorem 5.4.4, the random variable X 
can be thought of as the sum of a large number of i.i.d. random variables, each of which has a Poisson 
distribution. Hence, the central limit theorem (Lindeberg and Lévy) implies the desired result. It can 
also be shown that the m.g.f. of X converges to the m.g.f. of the standard normal distribution. 


3. By the previous exercise, X has approximately a normal distribution with mean 10 and standard 
deviation (10)!/2 = 3.162. Thus, without the correction for continuity, 
8 — 10 12— 10 


<a < 
3.162 ~ ~ 3.162 


Pr(8 < X < 12) =Pr ( ) ~ ®(.6325) — ®(—.6325) = .473. 


With the correction for continuity, we find 


2.9 eee 2.5 


Pye FS ios) =P — 
eS e128) r( 3.162 ~~ ~ 3.162 


) ~ 6(.7906) — &(—.7906) = .571. 


The exact probability is found from the Poisson table to be 
(.1126) + (.1251) + (.1251) + (.1187) + (.0948) = .571. 


Thus, the approximation with the correction for continuity is almost perfect. 


4. If X has p.d.f. f(x), then 


E(X*) = [ Oe [ ahi Gde S ae f(a)dx = t* Pr(X > 2). 


A similar proof holds if X has a discrete distribution. 


5. The central limit theorem says that X,, has approximately the normal distribution with mean p and 
variance p(1—p)/n. A variance stabilizing transformation will be 


a(z) = f pd —)1?ap, 


To perform this integral, transform to z = p'/?, that is, p= z?. Then 


iis [. dz 
1) = Jon Toa 
Next, transform so that z = sin(w) or w = arcsin(z). Then dz = cos(w)dw and 
arcsin ¢!/2 
aa) = / dw = arcsina/?, 
al 


resin a1/2 


where we have chosen a = 0. The variance stabilizing transformation is a(x) = arcsin(«!/?), 
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. According to the central limit theorem, X,, has approximately the normal distribution with mean 0 


and variance 6?. A variance stabilizing transformation will be 


a(z) = a 6~'d0 = log(a), 


where we have used a = 1. 


. Let F;, be the c.d.f. of X,;. The most direct proof is to show that lim. F(x) = F(a) for every point at 


which F’ is continuous. Since F is the c.d.f. of an integer-valued distribution, the continuity points are 
all non-integer values of x together with those integer values of x to which F assigns probability 0. It is 
clear, that it suffices to prove that Jim, F(x) = F(a) for every non-integer x, because continuity of F 
from the right and the fact that F’ is nondecreasing will take care of the integers with zero probability. 
For each non-integer x, let mz be the largest integer such that m < x. Then 


where the convergence follows because the sums are finite. 


. We know that Pr(X,, =m) = (*)pm(1 —pn)*-™ for m =0,...,k and all n. We also know that 


7 k m k-m __ k m k—-m 
Jim, (Knap = ("Jp (l—p)"™, 


for all m. By Exercise 7, X, converges in distribution to the binomial distribution with parameters k 
and p. 


. Let X1,...,X1ig be the times required to serve the 16 customers. The parameter of the exponenital 


distribution is 1/3. According to Theorem 5.7.8, the mean and variance of each X; are 3 and 9 
respectively. Let 4 X, = Y be the total time. The central limit theorem approximation to the 
distribution of Y is the normal distribution with mean 16 x 3 = 48 and variance 16 x 9 = 144. The 
approximate probablity that Y > 60 is 
60 — 48 
1 — 6 {| ——— } =1 — ®(1) = 0.1587. 
(Gare) (1) 

The actual distribution of Y is the gamma distribution with parameters 16 and 1/3. Using the gamma 
c.d.f., the probability is 0.1565. 


The number of defects in 2000 square-feet has the Poisson distribution with mean 2000 x 0.01 = 20. 
The central limit theorem approximation is the normal distribution with mean 20 and variance 20. 
Without correction for continuity, the approximate probability of at least 15 defects is 


15 — 20 


With the continuity correction, we get 


14.5 — 20 
1— @ | ———— } = 1 — ®(—1.2298) = 0.8906. 


The actual Poisson probability is 0.8951. 


da 


(a) 


(b) 
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The gamma distribution with parameters n and 3 is the distribution of the sum of n i.i.d. expo- 
nential random variables with parameter 3. If n is large, the central limit theorem should apply 
to approximate the distribution of the sum of n exponentials. 


The mean and variance of each exponential random variable are 1/3 and 1/9 respectively. The 
distribution of the sum of n of these has approximately the normal distribution with mean n/3 
and variance n/9. 


The exponential distribution with parameters n and 0.2 is the distribution of the sum of n i.i.d. 
geometric random variables with parameter 0.2. If n is large, the central limit theorem should 
apply to approximate the distribution of the sum of n geometrics. 


The mean and variance of each geometric random variable are 0.8/0.2 = 4 and 0.8/(0.2)? = 20. 
The distribution of the sum of n of these has approximately the normal distribution with mean 
4n and variance 20n. 
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Chapter 7 


Estimation 


7.1 Statistical Inference 


Commentary 


Many students find statistical inference much more difficult to comprehend than elementary probability 
theory. For this reason, many examples of statistical inference problems have been introduced in the early 
chapters of this text. This will give instructors the opportunity to point back to relatively easy-to-understand 
examples that the students have already learned as a preview of what is to come. In addition to the examples 
mentioned in Sec. 7.1, some additional examples are Examples 2.3.3-2.3.5, 3.6.9, 3.7.14, 3.7.18, 4.8.9-4.8.10, 
and 5.8.1—5.8.2. In addition, the discussion of M.S.E. and M.A.E. in Sec. 4.5 and the discussion of the variance 
of the sample mean in Sec. 6.2 contain inferential ideas. Most of these are examples of Bayesian inference 
because the most common part of a Bayesian inference is the calculation of a conditional distribution or a 
conditional mean. 


Solutions to Exercises 


1. 


The random variables of interest are the observables Xj, X2,... and the hypothetically observable 
(parameter) P. The X;’s are i.i.d. Bernoulli with parameter p given P = p. 


. The statistical inferences mentioned in Example 7.1.3 are computing the conditional distribution of P 


given observed data, computing the conditional mean of P given the data, and computing the M.S.E. 
of predictions of P both before and after observing data. 


. The random variables of interest are the observables 21, Z2,..., the times at which successive particles 


hit the target, and 3, the hypothetically observable (parameter) rate of the Poisson process. The hit 
times occur occording to a Poisson process with rate 6 conditional on 8. Other random variables of 
interest are the observable inter-arrival times Y; = Z,, and Y; = Z, — Z,_ 1 for k > 2. 


. The random variables of interest are the observable heights X7,...,X,, the hypothetically observable 


mean (parameter) jz, and the sample mean X,,. The X;’s are modeled as normal random variables with 
mean 4 and variance 9 given jL. 


. The statement that the interval (X;, — 0.98, X, + 0.98) has probability 0.95 of containing py is an 


inference. 


. The random variables of interest are the observable number X of Mexican-American grand jurors and 


the hypothetically observable (parameter) P. The conditional distribution of X given P = p is the 
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binomial distribution with parameters 220 and p. Also, P has the beta distribution with parameters a 
and 8, which have not yet been specified. 


7. The random variables of interest are Y, the hypothetically observable number of oocysts in t liters, the 
hypothetically observable indicators X1,X2,... of whether each oocyst is counted, X the observable 
count of oocysts, the probability (parameter) p of each oocyst being counted, and the (parameter) \ 
the rate of oocysts per liter. We model Y as a Poisson random variable with mean t given ’. We 
model Xj,...,X, as ii.d. Bernoulli random variables with parameter p given p and given Y = y. We 
define X = X, +...+ Xy. 


7.2 Prior and Posterior Distributions 


Commentary 


This section introduces some common terminology that is used in Bayesian inference. The concepts should all 
be familiar already to the students under other names. The prior distribution is just a marginal distribution 
while the posterior distribution is just a conditional distribution. The likelihood function might seem strange 
since it is a conditional density for the data given @ but thought of as a function of @ after the data have 
been observed. 


Solutions to Exercises 


1. We still have y = 16178, the sum of the five observed values. The posterior distribution of @ is now the 
gamma distribution with parameters 6 and 21178. So, 


f(xelaz) = [ 7.518 x 10736° exp(—211788)6 exp(—268)dB 


7.518 x 1023 68 exp(—6[21178 + 26])d8 


T(7) Als 10" 
= 7.518 x 10?3—__.\~ __ = ——* 
“(ATR + ae)? (21178 + x6)" 


for zg > 0. We can now compute Pr(X¢ > 3000|a) as 


c 5.413 x10 ~ 5.413 x 1076 


3 ai. 
v6 = “6 241786 


Pr(X¢6 > 3000 =) ————_—} 
r( |) 3000 (21178 + x6)" 


2. The joint p.f. of the eight observations is given by Eq. (7.2.11). Since n = 8 and y = 2 in this exercise, 
Lie |= 8)’. 
Therefore, 


€(0.1) fn(@ | 0-1) 
€(0.1) f(a | 0.1) + €(0.2) f(x | 0.2) 
(0.7)(0.1)(0.9)® 
(0.7) (0.1)2(0.9)® + (0.3)(0.2)2(0.8)6 
= 0.5418. 


€(0.1|a) =Pr(0=0.1|a) = 


It follows that €(0.2 | ©) = 1 - (0.1 | x) = 0.4582. 
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3. Let X denote the number of defects on the selected roll of tape. Then for any given value of A, the p-f. 
of X is. 


exp(—A)A*” 


i(e@|Aj)= for 0 Dawes 
x! 


Therefore, 


€(1.0) f(3 | 1.0) 
€(1.0)f (3 | 1.0) + €(1.5) f(3 | 1.5)” 


From the table of the Poisson distribution in the back of the book it is found that 


€(1.0| X =3) =Pr(A=1.0| X =3) = 


f(3| 1.0) = 0.0613 and f(3| 1.5) = 0.1255. 
Therefore, €(1.0 | X = 3) = 0.2456 and €(1.5 | X = 3) =1-— (1.0 | X = 3) = 0.7544. 


4. If @ and £ denote the parameters of the gamma distribution, then we must have 


a 
B =10 and RB =5 
Therefore, a = 20 and 6 = 2. Hence, the prior p.d.f. of @ is as follows, for 6 > 0: 
27° 19 
= 0 —20). 
£(0) = Trapp Pexw(-28) 


5. If a and 6 denote the parameters of the beta distribution, then we must have 


a 1 oat ap _ 2 
atB 3 (a+ 6)2(a+B+1) 90 
1 2 
Since ea = it follows that a - 3) = Therefore, 
ab B 12.2 
(a+ Bf) a+B a+B 3 3 9 


2 2 
It now follows from the second equation that {a+ B41) 7 and, hence, that a+ 6+ 1 = 10. 
Therefore, a + 6 = 9 and it follows from the first equation that a = 3 and 6 = 6. Hence, the prior 


p.d.f. of 6 is as follows, for 0 <4 <1: 
= ———_ §°(1 — 6)’. 


6. The conditions of this exercise are precisely the conditions of Example 7.2.7 with n = 8 and y = 3. 
Therefore, the posterior distribution of @ is a beta distribution with parameters a = 4 and 6 = 6. 
7. Since f,(a | @) is given by Eq. (7.2.11) with n=8 and y=3, then 
fala | 8)&(8) = 263(1 — 8)°. 
When we compare this expression with Eq. (5.8.3), we see that it has the same form as the p.d.f. of a 


beta distribution with parameters a = 4 and 6 = 7. Therefore, this beta distribution is the posterior 
distribution of 0. 
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By Eq. (7.2.14), 
€(9 | v1) « f(x | 8)E (9), 
and by Eq. (7.2.15), 
E( | 21,22) x f(xe | @)§(O | a1). 
Hence, 
€(9 | 1,22) x fxr | 0) f (x2 | A) (8). 
By continuing in this way, we find that 
E(0 | v1, 22,23) o f(x | O)E(O | v1, 22) x f(a | O)F (x2 | O) f(s | O)E(8). 
Ultimately, we will find that 
€(9 | v1,.--,%n) x f(t1|6)... flan | AE (4). 
From Eq. (7.2.4) it follows that, in vector notation, this relation can be written as 
€(9 | x) x fn(a | 8)€(9), 
which is precisely the relation (7.2.10). Hence, when the appropriate factor is introduced on the right 


side of this relation so that the proportionality symbol can be replaced by an equality, €(0 | x) will be 
equal to the expression given in Eq. (7.2.7). 


. It follows from Exercise 8 that if the experiment yields a total of three defectives and five nondefectives, 


the posterior distribution will be the same regardless of whether the eight items were selected in one 
batch or one at a time in accordance with some stopping rule. Therefore, the posterior distribution in 
this exercise will be the same beta distribution as that obtained in Exercise 6. 


In this exercise 


1 1 
1 forrd—-~<a2<64-, 
CO en ae 


0 otherwise, 


and 


1 
— for10<@< 20, 
e(0)=4 10 


0 otherwise. 


The condition that 9 — 1/2 < x < 6+ 1/2 is the same as the condition that « — 1/2 < 0 < «+ 1/2. 
Therefore, f(a | 6)€(@) will be positive only for values of 6 which satisfy both the requirement that 
x—1/2 <6@<«#+1/2 and the requirement that 10 < 6 < 20. Since X = 12 in this exercise, f(x | 0)E(@) 
is positive only for 11.5 < @ < 12.5. Furthermore, since f(x | 6)&(@) is constant over this interval, the 
posterior p.d.f. €(@ | x) will also be constant over this interval. In other words, the posterior distribution 
of 6 must be a uniform distribution on this interval. 
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11. Let y; denote the smallest and let yg denote the largest of the six observations. Then the joint p.d-f. 
of the six observations is 


1 1 

1 f —-i< 0+ -, 

fn(x | 0) = Ori 81 ales Cars 
0 otherwise. 


1 A 
The condition that 6 — 5 <y<y6 <O+ 5 is the same as the condition that yg — 1/2 < 6 < y, +1/2. 


Since €(0) is again as given in Exercise 10, it follows that f,(a | 0)€(@) will be positive only for values of 
@ which satisfy both the requirement that 10 < 0 < 20. Since y; = 10.9 and yg = 11.7 in this exercise, 
fn(x|@)E(@) is positive only for yg — 1/2 < 0 < y, + 1/2 and the requirement that 10 < 6 < 20. Since 
yi = 10.9 and yg = 11.7 in this exercise, f,,(a | 0)€(@) is positive only for 11.2 < @ < 11.4. Furthermore, 
since f,(a | 0)€(@) is constant over this interval, the posterior p.d.f. €(@ | w) will also be constant over 
the interval. In other words, the posterior distribution of 6 must be a uniform distribution on this 
interval. 


7.3 Conjugate Prior Distributions 


Commentary 


This section introduces some convenient prior distributions that make Bayesian inferences mathematically 
tractable. The instructor can remind the student that numerical methods are available for performing 
Bayesian inferences even when other prior distributions are used. Mathematical tractability is useful when 
introducing a new concept so that attention can focus on the meaning and interpretation of the new concept 
rather than the numerical methods required to perform the calculations. Although conjugate priors for the 
parameter of the uniform distribution are not discussed in the body of the section, Exercises ‘17 and 18 
illustrate how the general concept extends to these distributions. 


Solutions to Exercises 


1. The posterior mean of @ will be 


100 x 0 + 20v? x 0.125 
qi .§_— = 0.12. 

100 + 20v? 
We can solve this equation for v? by multiplying both sides by 100 + 20v? and collecting terms. The 
result is v? = 120. 


2. If we let y = (y+ 1(y+2+2), then 1-—y = (2+) (y¥+2+2) and V = 7(1-7)/(y+2+3). The 
maximum value of 7(1 — 7) is 1/4, and is attained when y = 1/2. Therefore, V < 1/[4(y+2+3)]. It 
now follows that if 1/[4(y + z+ 3)] < 0.01, then V < 0.01. But the first inequality will be satisfied if 
ytz > 22. Since y+ z is the total number of items that have been selected, it follows that this number 
need not exceed 22. 


3. Since the observed number of defective items is 3 and the observed number of nondefective items is 97, 
it follows from Theorem 7.3.1 that the posterior distribution of # is a beta distribution with parameters 
2+3=5 and 200 + 97 = 297. 
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4. Let a; and 3; denote the parameters of the posterior beta distribution, and let y = a;/(a,+ 1). Then 
7 is the mean of the posterior distribution and we are told that y = 2/51. The variance of the posterior 
distribution is 


a) 
(ay + 81)?(a4 + 61 + 1) at+8; a+ fi ay +6141 
1 
= wi 
¥( a) TA 
2 49 1 98 1 


From the value of this variance given in the exercise it is now evident that a; + 6; + 1 = 103. Hence, 
ay + 6, = 102 and ay = x(a; + 61) = 2(102)/51 = 4. In turn, it follows that 8, = 102 —4 = 98. 
Since the posterior distribution is a beta distribution, it follows from Theorem 7.3.1 that the prior 
distribution must have been a beta distribution with parameters a and ( such that a + 3 = a, and 
6B+97 = ~;. Therefore, a = @ = 1. But the beta distribution for which a = 6 = 1 is the uniform 
distribution on the interval [0,1]. 


5. By Theorem 7.3.2, the posterior distribution will be the gamma distribution for which the parameters 


n 
are 3+ 5° 2; =34+13= 16 andl +n=1+5=6. 
i=1 
6. The number of defects on a 1200-foot roll of tape has the same distribution as the total number of 
defects on twelve 100-foot rolls, and it is assumed that the number of defects on a 100-foot roll has 
the Poisson distribution with mean 0. By Theorem 7.3.2, the posterior distribution of 0 is the gamma 
distribution for which the parameters are 2+ 4 = 6 and 10+ 12 = 22. 


7. In the notation of Theorem 7.3.3, we have o? = 4, 4 = 68, v? = 1, n= 10, and Z, = 69.5. Therefore, 
the posterior distribution of @ is the normal distribution with mean fz; = 967/14 and variance v7 = 2/7. 


8. Since the p.d.f. of a normal distribution attains its maximum value at the mean of the distribution 
and then drops off on each side of the mean, among all intervals of length 1 unit, the interval that is 
centered at the mean will contain the most probability. Therefore, the answer in part (a) is the interval 
centered at the mean of the prior distribution of # and the answer in part (b) is the interval centered at 
the mean of the posterior distribution of 0. In part (c), if the distribution of 6 is specified by its prior 
distribution, then Z = @ — 68 will have a standard normal distribution. Therefore, 


Pr(67.5 < @ < 68.5) = Pr(—0.5 < Z < 0.5) = 26(0.5) = 1 = 0.3830. 


Similarly, if the distribution of @ is specified by its posterior distribution, then Z = (0 — 4)/v1, = 
(9 — 69.07) /0.5345 will have a standard normal distribution. Therefore, 


Pr(68.57 < 6 < 69.57| 8) = Pr(—0.9355 < Z < 0.9355) 
20(0.9355) — 1 = 0.6506. 


9. Since the posterior distribution of 0 is normal, the prior distribution of @ must also have been normal. 
Furthermore, from Eqs. (7.3.1) and (7.3.2), we obtain the relations: 


g — Bt (20) (10)v 
14200? 


10. 


11; 


12. 


13. 


14. 


15. 
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and 


1 _ v2 
25 14+ 20v2° 


It follows that v? = 1/5 and wp = 0. 


In this exercise, 0? = 4 and v? = 1. Therefore, by Eq. (7.3.2) 


4tn 
It follows that ve < 0.01 if and only if n > 396. 


In this exercise, c? = 4 and n = 100. Therefore, by Eq. (7.3.2), 


ea 4u> 1 ai 
1 44 100v2 ~ 25+ (1/v2) ~ 25° 


Since the variance of the posterior distribution is less than 1/25, the standard deviation must be less 
than 1/5. 


Let a and § denote the parameters of the prior gamma distribution of 0. Then a/@6 = 0.2 and 
a/8? =1. Therefore, 8 = 0.2 and a = 0.04. Furthermore, the total time required to serve the sample 
of 20 customers is y = 20(3.8) = 76. Therefore, by Theorem 7.3.4, the posterior distribution of 0 is the 
gamma distribution for which the parameters are 0.04 + 20 = 20.04 and 0.2 + 76 = 76.2. 


The mean of the gamma distribution with parameters a and £ is a/( and the standard deviation is 
q!/? /(@. Therefore, the coefficient of variation is a~'/2_ Since the coefficient of variation of the prior 
gamma distribution of 6 is 2, it follows that a = 1/4 in the prior distribution. Furthermore, it now 
follows from Theorem 7.3.4 that the coefficient of variation of the posterior gamma distribution of @ is 
(a+n)71/2 = (n+1/4)~/?. This value will be less than 0.1 if and only if n > 99.75.Thus, the required 
sample size is n > 100. 


Consider a single observation X from a negative binomial distribution with parameters r and p, where 
the value of r is known and the value of p is unknown. Then the p.f. of X has the form f(x | p) « p’q’. 
If the prior distribution of p is the beta distribution with parameters a and 3, then the prior p.d-f. €(p) 
has the form €(p) « p*~!q?-!. Therefore, the posterior p.d.f. €(p | x) has the form 


E(p | x) « E(p) f(p | x) « porn tg? ttt. 


This expression can be recognized as being, except for a constant factor, the p.d.f. of the beta distri- 
bution with parameters a+r and 6+. Since this distribution will be the prior distribution of p for 
future observations, it follows that the posterior distribution after any number of observations will also 
be a beta distribution. 


(a) Let y= 1/0. Then 6 = 1/y and dé = —dy/y?. Hence, 


[oa = [° tow ee eee 
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(b) If an observation X has a normal distribution with a known value of the mean y and an unknown 
value of the variance 0, then the p.d.f. of X has the form 
1 
SSS 
1/2 : (x — p) 
exp ~ oA 
Also, the prior p.d.f. of 6 has the form 


E(9) x 9+) exp(—8/8). 
Therefore, the posterior p.d.f. €(@ | x) has the form 


E(B | 2)  €(8) fle | 8) +9 exp {18 + He — w)?] +5, 


Hence, the posterior p.d.f. of 6 has the same form as €(0) with a replaced by a+ 1/2 and 6 
replaced by 8 + 1/2(x — y4)?. Since this distribution will be the prior distribution of 6 for future 
observations, it follows that the posterior distribution after any number of observations will also 
belong to the same family of distributions. 


f(x | @) x 


16. If X has the normal distribution with a known value of the mean and an unknown value of the 
standard deviation a, then the p.d.f. of X has the form 


(x - | . 


20? 


1 
fle | 0) x Sexp |- 
Therefore, if the prior p.d.f. €(a7) has the form 


E(a) x a“ exp(—b/0*), 


then the posterior p.d.f. of o will also have the same form, with a replaced by a+ 1 and b replaced by 
b+ (a —)?/2. It remains to determine the precise form of €(c). If we let y= 1/07, then o = y~'/? 
and do = —dy/(2y?/?). Therefore, 


[oe il (oe) 
: o “exp(—b/a?)do = 5/ y2-9)/? exp(—by)dy. 
0 0 
The integral will be finite if a > 1 and b > 0, and its value will be 
1 

pa-/2 

Hence, for a > 1 and } > 0, the following function will be a p.d.f. for 0 > 0: 
gp(a-1)/2 
1 

T (a S | 
Finally, we can obtain a more standard form for this p.d.f. by replacing a and b by a = (a — 1)/2 and 
8B =b. Then 


co} 0“ exp(—b/o”). 


a 


—(2a+1) 
Ta)” exp(—8/a7) foro > 0. 


&(o) = 


The family of distributions for which the p.d.f. has this form, for all values of a > 0 and £ > 0, will be 
a conjugate family of prior distributions. 
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17. The joint p.d.f. of the three observations is 


18. 


19. 


_ | ae? tor 6 <9 G = 1,2, 3), 
F(@1, 22,03 | 8) = 0 otherwise. 


Therefore, the posterior p.d.f. €(@ | 71,272,273) will be positive only if 8 > 4, as required by the prior 
p.d.f., and also @ > 8, the largest of the three observed values. Hence, for 6 > 8, 


€(0 | 21, 22,23) x €(0) f (21, 22, £3 | 0) « 1/0". 


Since 


ce | 1 
oo 6(8)®’ 


it follows that 


_ { 6(8°)/0" ford >8 
crams) =| 0 for 9 <8. 


Suppose that the prior distribution of @ is the Pareto distribution with parameters xp and a (xo > 0 
and a > 0). Then the prior p.d.f. €(@) has the form 


€(0) x 1/90t! for 0> 120. 

If X1,...,X, form a random sample from a uniform distribution on the interval [0,6], then 
Fale |G) oe 1/2” fot: GS Mies 24.6205 f 

Hence, the posterior p.d.f. of 6 has the form 
E(O | w) x E(9) fr(a | A) « Tier, 


for 9 > max{xo,21,...,%}, and €(@ | a) = 0 for 6 < max{zo,71,...,2%,}. This posterior p.d.f. can 
now be recognized as also being the Pareto distribution with parameters a+n and max{29,71,...,2n}. 


Commentary: Exercise 17 provides a numerical illustration of the general result presented in Exercise 18. 


The joint p.d.f. of X1,...,X, has the following form, for 0 < 2; < 1(i=1,...,n): 


n d-1 % 0 
g” (11 «| x 9” (11 «| 
i=1 i=l 


0” exp Q» log x). 


i=1 


fn(a@ | 8) 


The prior p.d.f. of 0 has the form 


&(0) x 0°! exp(—£6). 
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Hence, the posterior p.d.f. of 0 has the form 


(8 | @) 0 €(0) f(a | @) x O°" exp - ¢ ~ J log x] J | 
c=] 


This expression can be recognized as being, except for a constant factor, the p.d.f. of the gamma 
distribution with parameters a, = a+n and 3, = 8—)7r_, logz;. Therefore, the mean of the posterior 
distribution is a,/G, and the variance is a; /?. 


The mean lifetime conditional on ( is 1/6. The mean lifetime is then the mean of 1/3. Prior to 
observing the data, the distribution of 6 is the gamma distribution with parameters a and b, so the 
mean of 1/8 is b/(a—1) according to Exercise 21 in Sec. 5.7. After observing the data, the distribution 
of 8 is the gamma distribution with parameters a+ 10 and 6+ 60, so the mean of 1/ is (b+60)/(a+9). 
So, we must solve the following equations: 


b 
a—1 
b+ 60 
a+9 


= 4, 


These equations convert easily to the equations b = 4a — 4 and b= 5a — 15. Soa = 11 and b = 40. 


n 
The posterior p.d.f. is proportional to the likelihood 0” exp (-0d>«] times 1/0. This product can 
i=1 
be written as 0”! exp(—Onz,,). As a function of @ this is recognizable as the p.d.f. of the gamma 
distribution with parameters n and nz,,. The mean of this posterior distribution is then n/|n%,,] = 1/Fn. 


The posterior p.d.f. is proportional to the likelihood since the prior “p.d.f.” is constant. The likelihood 
is proportional to 


op -— _ (-0.95) . 


using the same reasoning as in the proof of Theorem 7.3.3 of the text. As a function of @ this is easily 
recognized as being proportional to the p.d.f. of the normal distribution with men —0.95 and variance 
60/20 = 3. The posterior probability that 9 > 1 is then 


1 — (—0.95) 
1-® (SS) = 1 — 0(1.126) = 0.1301. 


(a) Let the prior p.d.f. of 6 be €9,4(@). Suppose that X1,..., Xp are i.i.d. with conditional p.d-f. f(x|@) 
given 0, where f is as stated in the exercise. The posterior p.d.f. after observing these data is 


a(9)°+" exp [e(8) {8 + Nii d(2i)} | 
Jo a(8)2*™ exp [e(9) {8 + Dhar aes) }] do 


Eq. (8.7.1) is of the form of €,4/(9) with a’ =a+n and f’ = B+ 7, d(a;). The integral in 
the denominator of (S.7.1) must be finite with probability 1 (as a function of x1,...,2,,) because 
T[ji, 6(2;) times this denominator is the marginal (joint) p.d.f. of X1,..., Xn. 


é (6a) = (8.7.1) 


(b) This was essentially the calculation done in part (a). 


Section 7.3. Conjugate Prior Distributions 213 


24. In each part of this exercise we shall first present the p.d.f. or the p.f. f, and then we shall identify the 
functions a, b, c, and d in the form for an exponential family given in Exercise 23. 


(a) f(x |p) = p?(1—p)'* = (1—p) (4). Therefore, a(p) = 1—p, b(x) = 1, c(p) = log (4), 


(b) f(x | 6) = me. Therefore, a(@) = exp(—0), b(x) = 1/z!, c(@) = log 0, d(x) = x. 


(c) fe |p) = (ora — p)®. Therefore, a(p) = p", b(x) = ("*F1), e(p) = log(1 — p), 


(ee 
(d) 
1 (w— p)? 
fleln) = Goapee |- = | 
_ 1 x ye? x 
= Grain exp | —>— } exp | — pra | exP (4). 
1 ji? x Lb 
Therefore, a(w) = (anor &XP(— 953); bz) = exp(— >); ep) = = ae) =a: 
(e) F(a | 0?) = gobrrexp[-“GH"). Therefore, a(o?) = CHEB ba) = 1, (0%) = ~ 59, 
d(z) = (e— 1). . 
(fy fee | a) = aye exp(—8z). Therefore, a(a) = tay b(z) = exp(—6x), cla) = a-1, 
d(z) = log 
(g) f(a | 6) in this part is the same as the p.d.f. in part (f). Therefore, a(G) = 6%, b(x) = a 
c(8) = —B, d(x) = a. 
(h) - ” = one t-2)e Therefore, a(a) = aes iz\= Cl — c(a) =a-1, 
x) = log x. 
(i) f(x | 8) in this part is the same as the p.d.f. given in part (h). Therefore, a(G) = aa 
i(2) =F. (8) = 8-1, dle) = tog(t ~ 2) 


25. For every 6, the p.d.f. (or p.f.) f(x|0) for an exponential family is strictly postive for all x such that 
b(x) > 0. That is, the set of x for which f(z|@) > 0 is the same for all 0. This is not true for uniform 
distributions where the set of x such that f(a|@) > 0 is [0, 6]. 


26. The same reasoning applies as in the previous exercise, for uniform distributions, the set of x such that 
f(a|@) > 0 depends on 6. For exponential families, the set of x such that f(x|@) > 0 is the same for all 
0. 


214 Chapter 7. Estimation 


7.4 Bayes Estimators 


Commentary 


We introduce the fundamental concepts of Bayesian decision theory. The use of a loss function arises again 
in Bayesian hypothesis testing in Sec. 9.8. This section ends with foundational discussion of the limitations 
of Bayes estimators. This material is included for those instructors who want their students to have both a 
working and a critical understanding of the topic. 

If you are using the statistical software R, the function mentioned in Example 7.4.5 to compute the 
median of a beta distribution is qbeta with the first argument equal to 0.5 and the next two equal to a+ y 
and 6+n-—~y, in the notation of the example. 


Solutions to Exercises 


1. The posterior distribution of 8 would be the beta distribution with parameters 2 and 1. The mean 
of the posterior distribution is 2/3, which would be the Bayes estimate under squared error loss. The 
median of the posterior distribution would be the Bayes estimate under absolute error loss. To find the 
median, write the c.d.f. as 


0 
P(e) = | 2tdt = 02, 
0 


for 0 <6 <1. The quantile function is then F—!(p) = p!/?, so the median is (1/2)!/ = 0.7071. 


2. The posterior distribution of @ is the beta distribution with parameters 5+ 1 = 6 and 10+ 19 = 29. 
The mean of this distribution is 6/(6 + 29) = 6/35. Therefore, the Bayes estimate of @ is 6/35. 


3. If y denotes the number of defective items in the sample, then the posterior distribution of 6 will be 
the beta distribution with parameters 5+ y and 10+ 20 — y = 30 —y. The variance V of this beta 
distribution is 


(5 + y)(30—y) 


"=~ 5)2(26) 


Since the Bayes estimate of @ is the mean y of the posterior distribution, the mean squared error of 
this estimate is E[(@ — 1)? | 2], which is the variance V of the posterior distribution. 


(a) V will attain its maximum at a value of y for which (5+y)(30—y) is a maximum. By differentiating 
with respect to y and setting the derivative equal to 0, we find that the maximum is attained when 
y = 12.5. Since the number of defective items y must be an integer, the maximum of V will be 
attained for y = 12 or y = 13. When these values are substituted into (5 + y)(30 — y), it is found 
that they both yield the same value. 

Since (5+ y)(30—y) is a quadratic function of y and the coefficient of y? is negative, its minimum 
value over the interval 0 < y < 20 will be attained at one of the endpoints of the interval. It is 
found that the value for y = 0 is smaller than the value for y = 20. 


e 
~" 


4. Suppose that the parameters of the prior beta distribution of 6 are a and 6. Then pup = a/(a +). As 
shown in Example 7.4.3, the mean of the posterior distribution of @ is 
nm 
4 Xe == 
OP Og th 
a+B+n atB+n a+B+n 


n- 


Hence, yn, = n/(a+8+n) and yy, 1 as n—- ov. 


10. 


11. 


12. 
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. It was shown in Exercise 5 of Sec. 7.3 that the posterior distribution of 6 is the gamma distribution 


with parameters a = 16 and 6 = 6. The Bayes estimate of 6 is the mean of this distribution and is 
equal to 16/6 = 8/3. 


. Suppose that the parameters of the prior gamma distribution of 0 are a and 8. Then pio = a/3. The 


posterior distribution of 6 was given in Theorem 7.3.2. The mean of this posterior distribution is 


att, Xx; no o— 
211 = : Lo + Xn. 
B+n B+n oon 


Hence, y, = n/(8 +n) and yn, > 1 as n > ov. 


. The Bayes estimator is the mean of the posterior distribution of 0, as given in Exercise 6. Since @ is 


the mean of the Poisson distribution, it follows from the law of large numbers that X,, converges to 0 
in probability as n — oo. It now follows from Exercise 6 that, since 7, — 1, the Bayes estimators will 
also converge to @ in probability as n — oo. Hence, the Bayes estimators form a consistent sequence of 
estimators of 6. 


. It was shown in Exercise 7 of Sec. 7.3 that the posterior distribution of 0 is the normal distribution 


with mean 69.07 and variance 0.286. 


(a) The Bayes estimate is the mean of this distribution and is equal to 69.07. 


(b) The Bayes estimate is the median of the posterior distribution and is therefore again equal to 
69.07. 


. For any given values in the random sample, the Bayes estimate of 0 is the mean of the posterior 


distribution of 0. Therefore, the mean squared error of the estimate will be the variance of the posterior 
distribution of 8. It was shown in Exercise 10 of Sec. 7.3 that this variance will be 0.01 or less for n > 396. 


It was shown in Exercise 12 of Sec. 7.3 that the posterior distribution of 6 will be a gamma distribution 
with parameters a = 20.04 and 6 = 76.2. The Bayes estimate is the mean of this distribution and is 
equal to 20.04/76.2 = 0.263. 


Let X1,..., Xn denote the observations in the random sample, and let a and 3 denote the parameters 
of the prior gamma distribution of 0. It was shown in Theorem 7.3.4 that the posterior distribution of 
6 will be the gamma distribution with parameters a +n and 8+nX,. The Bayes estimator, which is 
the mean of this posterior distribution is, therefore, 


atn _ 1+ (a/n) 


Since the mean of the exponential distribution is 1/0, it follows from the law of large numbers that 
X,, will converge in probability to 1/@ as n — oo. It follows, therefore, that the Bayes estimators will 
converge in probability to 0 as n + oo. Hence, the Bayes estimators form a consistent sequence of 
estimators of 6. 


(a) A’s prior distribution for 0 is the beta distribution with parameters a = 2 and 6 = 1. Therefore, 
A’s posterior distribution for @ is the beta distribution with parameters 2+710 = 712 and 1+290 = 
291. B’s prior distribution for @ is a beta distribution with parameters a = 4 and 6 = 1. 
Therefore, B’s posterior distribution for @ is the beta distribution with parameters 4+ 710 = 714 
and 1+ 290 = 291. 
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13. 


14. 


15. 
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(b) A’s Bayes estimate of 6 is 712/(712+291) = 712/1003. B’s Bayes estimate of 6 is 714/(714+291) = 
714/1005. 

(c) If y denotes the number in the sample who were in favor of the proposition, then A’s posterior 
distribution for @ will be the beta distribution with parameters 2+ y and 1+ 1000 —y = 1001 — y, 
and B’s posterior distribution will be a beta distribution with parameters 4+ y and 1+ 1000—y = 
1001 — y. Therefore, A’s Bayes estimate of 6 will be (2+ y)/1003 and B’s Bayes estimate of 6 will 
be (4+ y)/1005. But 


4+y 2+y|  2(1001—y) 


1005 1003]  (1005)(1003) ” 
This difference is a maximum when y = 0, but even then its value is only 
2(1001) 2 
(1005)(1003) ~ 1000° 


If 6 has the Pareto distribution with parameters a > 1 and xo > 0, then 


_ {* aay __@ 
50) = [4 oat dd = —"— a, 


It was shown in Exercise 18 of Sec. 7.3 that the posterior distribution of 6 will be a Pareto distribution 
with parameters a + n and max{xo, X1,..., Xn}. The Bayes estimator is the mean of this posterior 
distribution and is, therefore, equal to (a + n) max{xq, Xj,...,Xn}/(a+n-—1). 


Since ~ = 6?, the posterior distribution of w can be derived from the posterior distribution of 6. The 
Bayes estimator 7) will then be the mean E (w) of the posterior distribution of 7. But E(w) = E(@)?, 
where the first expectation is calculated with respect to the posterior distribution of ~ and the second 
with respect to the posterior distribution of 6. Since 6 is the mean of the posterior distribution of 6, it 
is also true that 6 = E (0). Finally, since the posterior distribution of 6 is a continuous distribution, it 
follows from the hint given in this exercise that 


) = E(6") > [E(0))? = @. 


Let ag be a 1/(1 +c) quantile of the posterior distribution, and let a; be some other value. Assume 
that a; < ag. The proof for a; > ag is similar. Let g(@|x) denote the posterior p.d.f. The posterior 
mean of the loss for action a is 


Reaves c| (a — 6)9(6|x)dd +f (6 — a)g(6|x)d8. 
We shall now show that h(a;) > h(ao), with strict inequality if a; is not a 1/(1 +c) quantile. 


Najhtiey Se / ” te aa @la\do + [ “(ca thee Las ioleleia 


—co 


‘ [ * (ag — a1) 9(6|zx)d0 (8.7.2) 


The first integral in (S.7.2) equals c(a1 — ao)/(1+ ) because ag is a 1/(1+c) quantile of a the posterior 
distribution, and the posterior distribution is continuous. The second integral in (S.7.2) is at least as 
large as (a9 — a1) Pr(ag < 6 < ay|x) since —(1+c)6 > —(1+ c)a; for all 6 in that integral. In fact, the 
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integral will be strictly larger than (ag — a) Pr(ao < 6 < aj|x) if this probability is positive. The last 
integral in (S.7.2) equals (a9 — a1) Pr(@ > ai|x). So 


a1 — ao 


h(a,) — h(ao) > idee 


+ (ag — a1) Pr(@ > ag|x) = 0. (S.7.3) 


The equality follows from the fact that Pr(@ > ao|z) = c/(1+c). The inequality in (8.7.3) will be strict 
if and only if Pr(ap < 0 < aj|x) > 0, which occurs if and only if a; is not another 1/(1 +c) quantile. 


7.5 Maximum Likelihood Estimators 


Commentary 


Although maximum likelihood is a popular method of estimation, it can be valuable for the more capable 
students to see some limitations that are described at the end of this section. These limitations arise only in 
more complicated situations than those that are typically encountered in practice. This material is probably 
not suitable for students with a limited mathematical background who are learning statistical inference for 
the first time. 


Solutions to Exercises 


1. We can easily compute 


BY) = = iy = Dy 
n=l 
1. n 

BW?) = 2373 
n=l 


2. It was shown in Example 7.5.4 that the M.L.E. is %,. In this exercise, %,, = 58/70 = 29/35. 


3. The likelihood function for the given sample is p°°(1 — p)!2. Among all values of p in the interval 


1/2 < p < 2/3, this function is a maximum when p = 2/3. 


4. Let y denote the sum of the observations in the sample. Then the likelihood function is p¥(1 — p)"~¥. 
If y = 0, this function is a decreasing function of p. Since p = 0 is not a value in the parameter space, 
there is no M.L.E. Similarly, if y =n, then the likelihood function is an increasing function of p. Since 
p = 1 is not a value in the parameter space, there is no M.L.E. 


5. Let y denote the sum of the observed values 21,...,2%n. Then the likelihood function is 
exp(—n0)0¥ 
f(a | 0) = SPRAIN 


TJ (2) 


i=l 
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(a) If y > 0 and we let L(@) = log f(a | A), then 
0 y 


—L(0) =-n+ a 


The maximum of L(@) will be attained at the value of 6 for which this derivative is equal to 0. In 
this way, we find that 0 = y/n = Ey. 


(b) If y =0, then f,(a | @) is a decreasing function of #. Since 6 = 0 is not a value in the parameter 
space, there is no M.L.E. 


6. Let 6 =o”. Then the likelihood function is 


_ 1 i ee ye 
in|) = parmesn | 39 D(a lt) i 
If we let L(@) = log fn(x | 8), then 
6) on il. ee . : 
Fe eae da —p). 


The maximum of L(@) will be attained at a value of 6 for which this derivative is equal to 0. In this 
way, we find that 


7. Let y denote the sum of the observed values 21,...,2%p. Then the likelihood function is 
n(x | B) = B” exp(—By). 
If we let L(8) =log fn(x | 8), then 


The maximum of L(G) will be attained at the value of 8 for which this derivative is equal to 0. Therefore, 
B=n/y =1/Tn. 


8. Let y denote the sum of the observed values 71,...,2%,. Then the likelihood function is 


_ J exp(nO—y) for min{z1,...,2,}>0 
Infw | 8) = 0 otherwise. 
(a) For each value of x, f,(a | @) will be a maximum when @ is made as large as possible subject 
to the strict inequality 6 < min{z1,...,2,}. Therefore, the value 6 = min{z1,...,2,} cannot be 
used and there is no M.L.E. 


(b) Suppose that the p.d.f. given in this exercise is replaced by the following equivalent p.d.f., in which 
strict and weak inequalities have been changed: 


_ j exp(@—a2) forr>8@, 
Hel={ 5 ior a= @: 


Then the likelihood function f,(x | @) will be nonzero for 0 < min{z1,...,@,} and the M.L.E. will 
be? = ait Hig crag a 


9. 


10. 


11. 


12. 
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If0 <a; <1 fori =1,...,n, then the likelihood function will be as follows: 


a 6-1 
f(a | 8) = 6" (11 x) | 
i=1 
If we let L(@) = log fp(a | 6), then 
7) 7 5 
—L(0) =— 1 és 
0 (0) r S d og x 
Therefore, 6 = —n/ 37%, log 2;. It should be noted that 6 > 0. 


The likelihood function is 


1 nm 
fale |e) = an eXP {-3- |aj — i}. 
i=1 
n 
Therefore, the M.L.E. of 6 will be the value that minimizes + |x;—0|. The solution to this minimization 
i=l 
problem was given in the solution to Exercise 10 of Sec. 4.5. 


The p.d.f. of each observation can be written as follows: 


1 
Ge, fora <¢K< hy, 


Fe | 84) = | 0 


otherwise. 


Therefore, the likelihood function is 


1 


Inl& | 6,02) = (02 —0)" 


for 6, < min{z1,...,%,}< max{21,...,%n}<6o, and f,(a | 1,02) =0 otherwise. Hence, f,,(x | 61, 42) 
will be a maximum when 692 — 6; is made as small as possible. Since the smallest possible value of 92 is 
max{x ,...,%,} and the largest possible value of 0; is min{21,...,2,}, these values are the M.L.E.’s. 


The likelihood function is 
ies | Dixeasg ly) = Cie eae 
If we let L(01,...,9%) = log fn(# | 1,---,4%) and let 6, = 1—c*7} 6;, then 


OL(O1,--+9%) _ Mi me fori =1,...,k—1. 


06; 0; Op 
If each of these derivatives is set equal to 0, we obtain the relations 
6, 02 — 
Ny - ng a 7 Nk 


If we let 0; = an; fori =1,...,k, then 


Hence a = 1/n. It follows that 6; = ep Te 100 t= Nyc so Me 
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13. It follows from Eq. (5.10.2) (with 2; and x2 now replaced by x and y) that the likelihood function is 


1 


ier | ise) 90 aay (BGP) ~ (BS) (8S) + (82) ]} 


If we let L(u1, 2) = log f(x,y | 1, M2), then 


OL (1, H2) 1 j1f< p_(x< 
OPA): a Lee re ee = ilies Vy 
Butt 1-22 » i Oe 


i= 


OL (1, [2) 1 j1l/< p_(< 


Ope 0102 \iaa 


When these derivatives are set equal to 0, the unique solution is wy = %, and fg =¥%,. Hence, these 
values are the M.L.E.’s. 


7.6 Properties of Maximum Likelihood Estimators 


Commentary 


The material on sampling plans at the end of this section is a bit more subtle than the rest of the section, 
and should only be introduced to students who are capable of a deeper understanding of the material. 

If you are using the software R, the digamma function mentioned in Example 7.6.4 can be computed with 
the function digamma which takes only one argument. The trigamma function mentioned in Example 7.6.6 
can be computed with the function trigamma which takes only one argument. FR also has several functions like 
nlm and optim for minimizing general functions. The required arguments to nlm are the name of another R 
function with a vector argument over which the minimization is done, and a starting value for the argument. 
If the function has additional arguments that remain fixed during the minimization, those can be listed after 
the starting vector, but they must be named explicitly. For optim, the first two arguments are reversed. 
Both functions have an optional argument hessian which, if set to TRUE, will tell the function to compute a 
matrix of numerical second partial derivatives. For example, if we want to minimize a function f(x,y) over 
x with y fixed at c(3,1.2) starting from x=x0, we could use 
optim(x0,f,y=c(3,1.2)). If we wish to maximize a function g, we can define f to be —g and pass that to 
either optim or nlm. 


Solutions to Exercises 


1. The M.L.E. of exp(—1/0) is exp(—1/0), where 6 = —n/ S~ log(2;) is the M.L.E. of 0. That is, the 
i=1 
M.L.E. of exp(—1/6) is 


ox (2 stain) =e tj ie . -(iIs) 


i=1 i=1 


2. The standard deviation of the Poisson distribution with mean 6 is o = 6!/2. Therefore, 6 = 61/2. Tt 
was found in Exercise 5 of Sec. 7.5 that 9 = X>p. 
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. The median of an exponential distribution with parameter § is the number m such that 


[fF bexp(-Ger)ax = a 
0 2 


Therefore, m = (log 2)/8, and it follows that mh = (log 2)/8. It was shown in Exercise 7 of Sec. 7.5 
that p= 1/X a: 

. The probability that a given lamp will fail in a period of T hours is p = 1 — exp(—(T), and the 
probability that exactly x lamps will fail is (") p’(1 —p)”*. It was shown in Example 7.5.4 that 
p=2/n. Since 6 = —log(1— p)/T, it follows that 8 = —log(1 — 2/n)/T. 


. Since the mean of the uniform distribution is u = (a+ b)/2, it follows that fa = (@+ b)/2. It was shown 
in Exercise 11 of Sec. 7.5 that @ = min{Xj,...,X,,} and 6b = max{X),..., Xp}. 


. The distribution of Z = (X — )/o will be a standard normal distribution. Therefore, 


6 — 6 — 
0.95 = Px(X <@) =Pr(Z<—#) - 9 “). 


oO 


Hence, from a table of the values of ® it is found that (6 — y)/o = 1.645. Since 0 = p+ 1.6450, it 
follows that 6 = 1+ 1.645¢. By example 6.5.4, we have 


papas 9)<Pe(2> 22H) -1-0 (228) -0 (#22) 

Therefore, 7 = ®((ji — 2)/6). 

. Let 6 =I"(a)/T(a). Then 6 = I’'(a)/T'(a). It follows from Eq. (7.6.5) that 6 = 7%, (log X;)/n. 
. If we let y = 0, %, then the likelihood function is 


ier n a-l 
fn(& | a, B) = Tay" (11 «| exp(—6y). 


If we now let L(a, 8) = log f,(a | a, 8), then 


L(a, 8) =na log B—n log T'(a) + (a — 1) log (11 «| — By. 


a=1 


Since @ and must satisfy the equation OL(a, 3)/08 = 0 [as well as the equation OL(a, 8)/Oa = Oj, it 
follows that 4/6 = y/n =TZp. 
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10. The likelihood function is 


n a-l yn B-1 
Helo [resren] (is) [fle] 


If we let L(a, 8) = log fp(x | a, GB), then 
L(a,B) = n log T(a+8)—n log T(a) —n log I'(6) 


+(a—1) ie xi +(8—-1) Serre — 2}). 
i=l i=1 


and 


AL(a,6) _  W(a+8) _ 
OB T'(a+ £) 


The estimates @ and B must satisfy the equations OL(a, 8)/Oa = 0 and OL(a, 3)/08 = 0. Therefore, 
a@ and £ must also satisfy the equation OL(a, 3)/Oa = OL(a, 3)/08. This equation reduces to the one 
given in the exercise. 


11. Let Y, = max{X,,...,X,}. It was shown in Example 7.5.7 that 6 =Y,,. Therefore, for e > 0, 


. a 
Pr(|8 — 6] <2) = Pr(% > 6-2) =1- ( — 


Tt follows that lim. Pr(|9 — 0| < e) =1. Therefore, 6 + 0. 


12. We know that B =1/X,,. Also, since the mean of the exponential distribution is , = 1/3, it follows 
from the law of large numbers that X;, +, 1/8. Hence, B En B. 


13. Let Z; = —log X; fori = 1,...,n. Then by Exercise 9 of Sec. 7.5, 6 = 1/Zn. If X; has the p.d.f. 
f(a | 0) specified in that exercise, then the p.d.f. g(z | 0) of Z; will be as follows, for z > 0: 


gl | 8) = flexp(—z) | |] = 6(exp(—z))""* exp(—2) = Bexp(—62). 


Therefore, Z; has an exponential distribution with parameter 0. It follows that E(Z;) = 1/0. a er 


more, since X1,...,X, form a random sample from a distribution for which the p.d.f. is f(x | @), i 
follows that Z21,...,Z, will have the same joint distribution as a random sample from an tare 
distribution with parameter 9. Therefore, by the law of large numbers, Z,, =i /@. It follows that 
6 *: 6. 


14. The M.L.E. p is equal to the proportion of butterflies in the sample that have the special marking, 
regardless of the sampling plan. Therefore, (a)p = 5/43 and (b) p = 3/58. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 
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As explained in this section, the likelihood function for the 21 observations is equal to the joint p.d_f. 
of the 20 observations for which the exact value is known, multiplied by the probability exp(—15/,) 
that the 21st observation is greater than 15. If we let y denote the sum of the first 20 observations, 
then the likelihood function is 


cm exv(-u/H) exp(—15/y1). 


Since y = (20)(6) = 120, this likelihood function reduces to 


1 
70 exp(—135/,1). 


The value of js which maximizes this likelihood function is fj = 6.75. 


The likelihood function determined by any observed value x of X is 0°x? exp(—0x)/2. The likelihood 
function determined by any observed value y of Y is (20)¥ exp(—20)/y!. Therefore, when X = 2 and 
Y = 3, each of these functions is proportional to 0° exp(—20). The M.L.E. obtained by either statistician 
will be the value of @. which maximizes this expression. That value is 6 = 3/2. 


10\ , 
The likelihood function determined by any observed value x of X is p*(1—p)'°*. By Eq. (5.5.1) 
a 


o+y 


the likelihood function determined by any observed value y of Y is ( p'(1—p)¥. Therefore, when 


X =4and Y =6, each of these likelihood functions is proportional to p4(1—>p)®. The M.L.E. obtained 
by either statistician will be the value of p which maximizes this expression. That value is p = 2/5. 


The mean of a Bernoulli random variable with parameter p is p. Hence, the method of moments 
estimator is the sample mean, which is also the M.L.E. 


The mean of an exponential random variable with parameter ( is 1/3, so the method of moments 
estimator is one over the sample mean, which is also the M.L.E. 


The mean of a Poisson random variable is 6, hence the method of moments estimator of 6 is the sample 
mean, which is also the M.L.E. 


The M.L.E. of the mean is the sample mean, which is the method of moments estimator. The M.L.E. of 
o” is the mean of the X?’s minus the square of the sample mean, which is also the method of moments 
estimator of the variance. 


The mean of X; is 0/2, so the method of moments estimator is 2X,. The M.L.E. is the maximum of 
the X; values. 


(a) The means of X; and X? are respectively a/(a+ B) and a(a+1)/[(a+8)(a+B+1). We set 
these equal to the sample moments %, and x?,, and solve for a and 8. After some tedious algebra, 


we get 
a: ee Tn(Tn wn) 
Ln — 2 
B _ (1 — Fpn)(Fn — ty) 
7 2 2 
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(b) The M.L.E. involves derivatives of the gamma function and the products [[j_; x; and ]]j_, (1-2). 


24. The p.d.f. of each (X;, Y;) pair can be factored as 


a 1. 2 1 = é 
(Qn) 1/2, exp (-s3te: — pti) ) Qn)on1 exp (-s (yi — a — Baxi) ) ; (S.7.4) 


where the new parameters are defined in the exercise. The product of n factors of the form (S.7.4) can 
be factored into the product of the n first factors times the product of the n second factors, each of 
which can be maximized separately because there are no parameters in common. The product of the 
first factors is the same as the likelihood of a sample of normal random variables, and the M.L.E.’s are 
ji, and o? as stated in the exercise. The product of the second factors is slightly more complicated than 
the likelihood from a sample of normal random variables, but not much more so. Take the logarithm 
to get 


ESS pea (8.7.5) 


—F[log(2m) + log(o3 1)] - 
i=1 


2 
205 4 


Taking the partial derivatives with respect to @ and £ yields 


fa) n 
aa oF, 2a a fa;), 
O iL «= 
ay oo Sp LAY; — A — PX 


Setting the first line equal to 0 and solving for a yields 
Q= YJ, — BEn. (S.7.6) 
Plug (S.7.6) into the second of the partial derivatives to get (after a bit of algebra) 


ei _ In) (Yi _ Un) 


be aC (8.7.7) 
Substitute (S.7.7) back into (S.7.6) to get 
&=Fn — Bf. 
Next, take the partial derivative of (8.7.5) with respect to 03, to get 
n 1 2 
~ igh, * Get, 2 (vi — a — Bx;)”. (S.7.8) 


Now, substitue both @ and f into (8.7.8) and solve for 03 ;. The result is 
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Finally, we can solve for the M.L.E.’s of the original parameters. We already have ji; and a2. The 
equation @ = U2 — po2fi1/o1 can be rewritten a = juz — By. It follows that 


jig = & + Bi =Tn- 


The equation 8 = po2/o, can be rewritten poo = Bo,. Plugging this into 03, = (1 — p*)o} yields 
o2 , = 02 — B02. Hence, 


A= A+ Pa 
= = \72 

_ es th -—e- Ba;)? re Dea (Yi = In) (xi es Tn) 
= i a = 

m i=l Die (24 >= En)? 

1 
= (Yi _ Tn): 

a 

i=1 


where the final equality is tediuous but straightforward algebra. Finally, 


2 [Dk (ors — Bn)] (Pa (ve — Bn) 


25. When we observe only the first n — k Y;’s, the M.L.E.’s of j4; and of are not affected. The M.L.E.’s of 
a, 8 and o3, are just as in the previous exercise but with n replaced by n — k. The M.L.E.’s of po, 03 


and p are obtained by substituting 4, B and 03, into the three equations Exercise 24: 


ji2 = &4+ Bin 
03 = 03,+ of 
. _ BR 
Po = Be 

02 


7.7 Sufficient Statistics 


Commentary 


The concept of sufficient statistics is fundamental to much of the traditional theory of statistical inference. 
However, it plays little or no role in the most common practice of statistics. For the most popular distribu- 
tional models for real data, the most obvious data summaries are sufficient statistics. In Bayesian inference, 
the posterior distribution is automatically a function of every sufficient statistic, so one does not even have 
to think about sufficiency in Bayesian inference. For these reasons, the material in Secs. 7.7—7.9 should only 
be covered in courses that place a great deal of emphasis on the mathematical theory of statistics. 


Solutions to Exercises 


In Exercises 1-11, let t denote the value of the statistic 7’ when the observed values of X1,...,Xn are 
1,-..,%y. In each exercise, we shall show that T is a sufficient statistic by showing that the joint p.f. or 
joint p.d.f. can be factored as in Eq. (7.7.1). 


1. The joint p.f. is 


fn(@ |p) = p'(1—p)"~. 
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. The joint p-f. is 


fn(w | p) = p"(1— py’. 


. The joint p-f. is 


fa(x | p) = TI ( ie ‘) "(1 — p)']. 


i=l vy 


Since the expression inside the first set of square brackets does not depend on the parameter p, it follows 
that T is a sufficient statistic for p. 


. The joint p.d-f. is 


1 t 
Fale | a) = BapepTE OP { — aa}. 


. The joint p.d-f. is 


u . - na 
fn(x | 8) = ror (11 x] \ exp(—nft)}. 


i=1 


. The joint p.d.f. in this exercise is the same as that given in Exercise 5. However, since the unknown 


parameter is now a instead of 2, the appropriate factorization is now as follows: 


idx |o)= fo (-0%a)} (mare } 


. The joint p.d-f. is 


7 ii n pod T'(a+ £) m a—1 
jelo)=| we [fo] {eee 


Therefore, the joint p.f. is 


1 
— fort <4, 
fn(@|0)= 4 0” 


0 otherwise. 


If the function h(t, 0) is defined as in Example 7.7.5, with the values of t and @ now restricted to positive 
integers, then it follows that 


fn(ae | 8) = ahlt,0). 


10. 


11. 


12. 


13. 


14. 


15. 


16. 
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. The joint p.d-f. is 


h(t, b) 
(a |b) = ee 
ful |b) = Go 
where fh is defined in Example 7.7.5. 
The joint p.d-f. is 


h(a, t) 


fr(@ | a) = baa’ 
where fh is defined in Example 7.7.5. 


The joint p.d.f. or joint p.f. is 


(ae | 0) = { Ty0 (x4 1} ato |” exp[c(0)t] }. 


The likelihood function is 


Cai 


eee 


for all x; > xo. 


(8.7.9) 


(a) If zo is known, a is the parameter, and (8.7.9) has the form u(x)u[r(a),a], with u(a) = 1 if all 
x; > 2 and 0 if not, r(x) = []#_, 2, and vt, a] = axe" /t°T1. So T]#_, X; is a sufficient statistic. 

(b) If a is known, xp is the parameter, and (8.7.9) has the form u(x)v[r(x), xo], with u(a) = 
a TT, 2Je, r(w) = min{r,,...,2n}, and vit,z9] = 1 if t > xo and O if not. Hence 
min{X1,...,Xn} is a sufficient statistic. 


The statistic T will be a sufficient statistic for 6 if and only if f,,(a | @) can be factored as in Eq. (7.7.1). 
However, since r(a) can be expressed as a function of r’(a), and conversely, there will be a factorization 
of the form given in Eq. (7.7.1) if and only if there is a similar factorization in which the function v 
is a function of r’(a) and 6. Therefore, T will be a sufficient statistic if and only if T’ is a sufficient 
statistic. 


This result follows from previous exercises in two different ways. First, by Exercise 6, the statistic 

= [[i, X; is a sufficient statistic. Hence, by Exercise 13, T = log T” is also a sufficient statistic. 
A second way to establish the same result is to note that, by Exercise 24(g) of Sec. 7.3, the gamma 
distributions form an exponential family with d(x) = log x. Therefore, by Exercise 11, the statistic 
T => d(%) is a sufficient statistic. 


It follows from Exercise 11 and Exercise 24(i) of Sec. 7.3 that the statistic T’ = )~"_, log(1 — X;) is a 
sufficient statistic. Since T is a one-to-one function of J”, it follows from Exercise 13 that T is also a 
sufficient statistic. 


Let f(@) be a prior p.d.f. for 9. The posterior p.d.f. of @ is, according to Bayes’ theorem, 
fr(x|9) f(@) ae ae 


OT scald) s ea wav [ute wd [otra vA ab (w)dy 


where the second equality uses the factorization criterion. One can see that this last expression depends 
on « only through r(a). 
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First, suppose that T is sufficient. Then the likelihood function from observing X = @ is u(x)v[r(x), 6], 
which is proportional to v[r(a), 6]. The likelihood from observing T = t (when t = r(a)) is 


S- u(x)u[r(a), 0] = v{E, 6] be u(x), (S.7.10) 
where the sums in (8.7.10) are over all x such that t = r(a). Notice that the right side of (S.7.10) 
is proportional to v|t, @] = v[r(a), 6]. So the two likelihoods are proportional. Next, suppose that the 


two likelihoods are proportional. That is, let f(a|@) be the p.f. of X and let h(t|@) be the p.f. of T. If 
t = r(x) then there exists c(a) such that 


f(w\0) = u(x)h(t\9). 


Let vt, 6] = h(t|0) and apply the factorization criterion to see that T is sufficient. 


7.8 Jointly Sufficient Statistics 


Commentary 


Even those instructors who wish to cover the concept of sufficient statistic in Sec. 7.7, may decide not to 
cover jointly sufficient statistics. This material is at a slightly more mathematical level than most of the text. 


Solutions to Exercises 


In Exercises 1—4, let t; and tg denote the values of 7, and J) when the observed values of X1,...,Xy are 


T15-+ 


.,;Zn. In each exercise, we shall show that T, and 75 are jointly sufficient statistics by showing that the 


joint p.d.f. of X1,...,X, can be factored as in Eq. (7.8.1). 


1. 


The joint p.d-f. is 
pne a 
fn(x | a, B) = att exp(—ft2). 
[ar * 
. The joint p.d-f. is 
I'(a + B) } e-1,8+1 
n b] —_ An Ar gah t t . 
ne (x | a B) Fong il 2 
. Let the function h be as defined in Example 7.8.4. Then the joint p.d.f. can be written in the following 
form: 
axe n 
fale | 20,0) = BF (xo, t) 
2 


. Again let the function h be as defined in Example 7.8.4. Then the joint p.d.f. can be written as follows: 


h(O, ti)h(ta, 0 + 3) 


fn(x | 0) = —— = 


5. 


10. 


1. 
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The joint p.d.f. of the vectors (X;, Y;), fori = 1,...,n, was given in Eq. (5.10.2). The following relations 
hold: 


nm nm 


nm 
Yo (ei — wa)? = SS a} — 2p SO ay + pi, 


i=l i=1 i=l 


nm n 
su = fiz)? = dvi — 2p2 dvi + nya, 
n n 
die — p1)( = Sew b>) @; = Sy + rep. 


i=1 i=1 i=1 


Because of these relations, it can be seen that the joint p.d.f. depends on the observed values of the 
n vectors in the sample only through the values of the five statistics given in this exercise. Therefore, 


they are jointly sufficient statistics. 
ie |e) = {Ten } (@)|” exp aim 3u «ea| 
i=1 


It follows from the factorization criterion that T,...,7; are jointly sufficient statistics for 0. 


. The joint p.d.f. or joint p-f. is 


cod 


sta 


. In each part of this exercise we shall first present the p.d.f. f, and then we shall identify the functions 


a, 6, C1, d1, C2, and dg in the form for a two-parameter exponential family given in Exercise 6. 
(a) Let 0 = (u,07). Then f(z | @) is as given in the solution to Exercise 24(d) of Sec. 7.3. Therefore, 
1 1 9 Lb 
a8) = Tania xP P(- Wor £,), me )=1, C1 (0) = ~ Dgk? di(x) =x » 2(9) =, do(x) = a. 
(b) Let 6 = (a, 8). Then f(x | @) is as given in the solution to Exercise 24(f) of Sec. 7.3. Therefore, 
a(6) = Ls (2) = 1, c1(0) = a1, di() = log 2, o9(0) = 8, do(x) = 2. 
a 


(c) Let 6 = (a,8). Then f(z | @) is as given in the solution to Exercise 24(h) of Sec. 7.3. Therefore, 


a(@) = ant b(x) = 1, (0) = a—1, d)(x) = logaz, co(@) = B — 1, and d2(x) = log(1 — 2). 


. The M.L.E. of 8 is n/>7¥_, X;. (See Exercise 7 in Sec. 7.5.) This is a one-to-one function of the 


sufficient statistic found in Exercise 5 of Sec. 7.7. Hence, the M.L.E. is sufficient. This makes it 
minimal sufficient. 


. By Example 7.5.4, 6 = X,. By Exercise 1 of Sec. 7.7, is a sufficient statistic. Therefore, f is a minimal 


sufficient statistic. 


By Example 7.5.7, 6 = max{X1,...,Xn}. By Example 7.7.5, 6 is a sufficient statistic. Therefore, 6 is 
a minimal sufficient statistic. 


By Example 7.8.5, the order statistics are minimal jointly sufficient statistics. Therefore, the M.L.E. 
of @, all by itself, cannot be a sufficient statistic. (We know from Example 7.6.5 that there is no 
simple expression for this M.L.E., so we cannot solve this exercise by first deriving the M.L.E. and then 
checking to see whether it is a sufficient statistic.) 
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If we let T = max{Xj,..., Xn}, let ¢ denote the observed value of T, and let the function h be as 
defined in Example 7.8.4, then the likelihood function can be written as follows: 


(tt) 
an At, 8). 


This function will be a maximum when 9 is chosen as small as possible, subject to the constraint that 
h(t,@) = 1. Therefore, the M.L.E. of @ is 0 = t. The median m of the distribution will be the value 
such that 


fn(@ | 0) = 


m 1 
is fla | 0) de = 5. 


Hence, m = 0/,/2. It follows from the invariance principle that the M.L.E. of m is rn = 6/V2 = t/V2. 


By applying the factorization criterion to f,(a | 0), it can be seen that the statistic T is a sufficient 
statistic for 0. Therefore, the statistic T/V/2 which is the M.L.E. of m, is also a sufficient statistic for 
6. 


By Exercise 11 of Sec. 7.5, @ = min{Xj,...,X,} and b= max{X,,...X,}. By Example 7.8.4, @ and b 
are jointly sufficient statistics. Therefore, @ and b are minimal jointly sufficient statistics. 


It can be shown that the values of the five M.L.E.’s given in Exercise 24 of Sec. 7.6 can be derived from 
the values of the five statistics given in Exercise 5 of this section by a one-to-one transformation. Since 
the five statistics in Exercise 5 are jointly sufficient statistics, the five M.L.E.’s are also jointly sufficient 
statistics. Hence, the M.L.E.’s will be minimal jointly sufficient statistics. 


The Bayes estimator of p is given by Eq. (7.4.5). Since 5>_, a; is a sufficient statistic for p, the Bayes 
estimator is also a sufficient statistic for p. Hence, this estimator will be a minimal sufficient statistic. 


It follows from Theorem 7.3.2 that the Bayes estimator of \ is (a+ )0_, X;)/(8 +n). Since \77_, X; is 
a sufficient statistic for A, the Bayes estimator is also a sufficient statistic for A. Hence, this estimator 
will be a minimal sufficient statistic. 


The Bayes estimator of ju is given by Eq. (7.4.6). Since X,, is a sufficient statistic for y, the Bayes 
estimator is also a sufficient statistic. Hence, this estimator will be a minimal sufficient statistic. 


Improving an Estimator 


Commentary 


If you decided to cover the material in Secs. 7.7 and 7.8, this section gives one valuable application of that 
material, Theorem 7.9.1 of Blackwell and Rao. This section ends with some foundational discussion of the 
use of sufficient statistics. This material is included for those instructors who want their students to have 
both a working and a critical understanding of the topic. 


Solutions to Exercises 


1, 


The statistic Y, = 7", X? is a sufficient statistic for 0. Since the value of the estimator 5; cannot be 
determined from the value of Y,, alone, 6; is inadmissible. 


4, 


re 
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. A sufficient statistic in this example is max{X,,...,X;,}. Since 2X, is not a function of the sufficient 


statistic, it cannot be admissible. 


. The mean of the uniform distribution on the interval [0,6] is 9/2 and the variance is 07/12. Therefore, 


Eo(Xn) = 0/2 and Varg(X,,) = 67/(12n). In turn, it now follows that 


2 


E6(61) = 6 and Varg(d1) = = 


Hence, for 6 > 0, 


Q2 
R(O, 61) = Eo[(d1 — 9)7] = Varo (51) = = 
(a) It follows from the discussion given in Sec. 3.9 that the p.d.f. of Y, is as follows: 


n—1 


ny 
gly |) = Ge 
0 otherwise. 


for0O<y <0, 


Hence, for @ > 0, 


0 n—-1 262 
R(0,62) = £ ¥n- 6)" = [ 2 
(0,03) = Eal Ya — 9°1= [uy - 0) ay = 

(b) if n = 2, R(0,61) = R(0, 52) = 07/6. 

(c) Suppose n > 3. Then R(6,62) < R(O,6,) for any given value of 6 > 0 if and only if 2/[(n + 
1)(n + 2)] < 1/(3n) or, equivalently, if and only if 6n < (n+1)(n +2) = n?+3n+42. Hence, 
R(0, 52) < R(0, 61) if and only if n? — 3n +2 > 0 or, equivalently, if and only if (n — 2)(n —1) > 0. 
Since this inequality is satisfied for all values of n > 3, it follows that R(@,d2) < R(0,61) for every 
value of 0 > 0. Hence, 62 dominates 61. 


. For any constant c, 


R(6,cYn) = Eo[(c¥n — 0)"] = Eo (Y,2) — 2c Eg(Yn) + 0? 


2 
( us ex ui c+1) 6 
n+2 n+1 


Hence, for any given value of n and any given value of 6 > 0, R(0,cY,) will be a minimum when c is 
chosen so that the coefficient of 6? in this expression is a minimum. By differentiating with respect to 
c, we find that the minimizing value of c is c= (n+ 2)(n +1). Hence, the estimator (n + 2)Y;,/(n + 1) 
dominates every other estimator of the form cY,. 


. It was shown in Exercise 6 of Sec. 7.7 that []/_, X; is a sufficient statistic in this problem. Since the 


value of X,, cannot be determined from the value of the sufficient statistic alone, X,, is inadmissible. 


(a) Since the value of 6 is always 3, R(8,5) = (8 —3)?. 


(b) Since R(3,5) = 0, no other estimator 6; can dominate 6 unless it is also true that R(3,6,) = 0. 
But the only way that the M.S.E. of an estimator 6, can be 0 is for the estimator 6; to be equal 
to 3 with probability 1. In other words, the estimator 6; must be the same as the estimator 0. 
Therefore, 6 is not dominated by any other estimator and it is admissible. 
In other words, the estimator that always estimates the value of @ to be 3 is admissible because it 
is the best estimator to use if 6 happens to be equal to 3. Of course, it is a poor estimator to use 
if 6 happens to be different from 3. 
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8. It was shown in Example 7.7.2 that >>, X; is a sufficient statistic in this problem. Since the proportion 


B of observations that have the value 0 cannot be determined from the value of the sufficient statistic 
alone, ( is inadmissible. 


9. Suppose that X has a continuous distribution for which the p.d.f. is f. Then 


x)= of(e)de+ fafa) dz, 


—co 


Suppose first that E(X) <0. Then 


a= 20) = ff ayfteydr— [of ade 
= [oredr [of ae 


[ lelf@) ae = E(X). 


A similar proof can be given if X has a discrete distribution or a more general type of distribution, or 
if H(X) > 0. 


Alternatively, the result is immediate from Jensen’s inequality, Theorem 4.2.5. 


10. We shall follow the steps of the proof of Theorem 7.9.1. It follows from Exercise 9 that 
Eq(|6 — @||T) > | Eo(6 — 0|T)| = |Eo(6|T) — A] = |d0 — 4]. 
Therefore, 
E4(|60 — |) < Eo[Eo(|6 — @||T)] = Eo(|d — @)). 


11. Since 6 is the M.L.E. of 0, we know from the discussion in Sec. 7.8 that 6 is a function of T alone. 


Since @ is already a function of T, taking the conditional expectation F (6|T) will not affect 6. Hence, 
69 = E(6|T) = 0. 


12. Since X; must be either 0 or 1, 
E(x, |f =e) —Pri Ay = 1/7 =4), 
If t = 0 then every X; must be 0. Therefore, 
A(x, |T=0) =0. 


Suppose now that t > 0. Then 


n 
Pr (2 = 1 and yx] 
Ppeeaijre= i=2 


Pr =) Pry 7) 


13. 
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Since X, and )7i.. X; are independent, 


l 
Uv 
p28 
= 

I 
= 
Uv 

Kw 

: 

Me 
Ea 
+ 

| 
e 
Se 


Pr (2 =1 and Sx] 


4=2 


Also, Pr{l=t) = (") p'(1—p)”*. It follows that 


Pr(X, =1|T =t) = Gee = ~, 


Therefore, for any value of T, 
BGT) =Tn=X,, 


A more elegant way to solve this exercise is as follows: By symmetry, E(X1|T) = E(X2|T) =... = 
E(X,,|T). Therefore, nE(X1|T) = i_, E(X;|T). But 


7 H(X;|T) =E #(SoxI7) = E(T|T) = 
i=1 
Hence, E(X,|T) = T/n. 


We shall carry out the analysis for Y;. The analysis for every other value of i is similar. Since Y; must 
be 0 or 1, 


E(Y\T=t) = Pr(% =1/T =t) =Pr(X, =0|T =2) 


Pr | X; =0 and X,=t 
— Pr{(X,=Oand T=t) _ ( dX 
7 Pr(T = t) 7 Pr(T = t) 


The random variables X; and 5>"_, X; are independent, X, has a Poisson distribution with mean 0, 
and )7/_»4 X; has a Poisson distribution with mean (n — 1)0. Therefore, 


. = Tn _ t 

Pr (31 =0 aud 3x =| - Pr Xj. = 0) P t(Soxi=t) =exnt- gy SPC DOK AT 
i=2 ! 

Also, since T has a Poisson distribution with mean n@, 

exp(—n6)(n0)* 


Pr(T =t) = 7 


It now follows that E(Y,|T = t) = ({n —1]/n)!. 
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n 

If Y; is defined as in Exercise 12 for i=1,...,n, then B= > Y;/n. Also, we know from Exercise 12 
i=1 

that E(Y;|T) = ({n —1]/n)? for i=1,...,n. Therefore, 


at) = (23oxI7) =i aM |7)=F-n() - (nt) 


n 


Let 6 be the M.L.E. of 6. Then the M.L.E. of exp(@ + 0.125) is exp(6 + 0.125). The M.L.E. of 0 is Xn, 
so the M.L.E. of exp(9+0.125) is exp(X, +0.125). The M.S.E. of an estimator of the form exp(Xp, +c) 
is 


E [(expiXn +c] — exp[6+ 0.125))?] 


2 

= Var(exp[X, + c]) + [E(expXn +c]) — exp(6 + 0.125)| 

= exp(20 + 0.25/n + 2c)[exp(0.25/n) — 1] + [exp(@ + 0.125/n + c) — exp(@ + 0.125)]? 

= exp(26){exp(0.25/n + 2c)[exp(0.25/n) — 1] + exp(0.25/n + 2c) — 2exp(0.125[1 + 1/n] +c) 

+ exp(0.5)} 

= exp(26) [exp(2c) exp(0.5/n) — 2 exp(c) exp(0.125[1 + 1/n]) + exp(0.5)] . 
Let a = exp(c) in this last expression. Then we can minimize the M.S.E. simultaneously for all 6 by 
minimizing 

a” exp(0.5/n) — 2aexp(0.125[1 + 1/n]) + exp(0.5). 

The minimum occurs at a = exp(0.125 — 0.375/n), so c = 0.125 — 0.375/n. 


p = Pr(X; = 1|@) = exp(—0@)@. The M.L.E. of 6 is the number of arrivals divided by the observation 
time, namely X,. So, the M.L.E. of p is exp(—Xn)Xn. In Example 7.9.2, T/n = Xn. If n is large, 
then T should also be large so that (1 —1/n)"  exp(—T/n) according to Theorem 5.3.3. 


7.10 Supplementary Exercises 


Solutions to Exercises 


1. 


2: 


(a) The prior distribution of 6 is the beta distribution with parameters 1 and 1, so the posterior 
distribution of @ is the beta distribution with parameters 1+ 10 = 11 and 14 25—10= 16. 


(b) With squared error loss, the estimate to use is the posterior mean, which is 11/27 in this case. 


We know that the M.L.E. of @= X,. Hence, by the invariance property described in Sec. 7.6, the 
M.L.E. of 6? is X72. 


. The prior distribution of 0 is the beta distribution with a = 3 and 6 = 4, so it follows from Theorem 7.3.1 


that the posterior distribution is the beta distribution with a = 3+3=6 and 6’=4+7=11. The 
Bayes estimate is the mean of this posterior distribution, namely 6/17. 


. Since the joint p.d.f. of the observations is equal to 1/6" provided that 6 < x; < 20 fori = 1,...,n, 


the M.L.E. will be the smallest value of 6 that satisfies these restrictions. Since we can rewrite the 
restrictions in the form 


1 
3g max{21,--. jt SO Sh Pig ene tiny 
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it follows that the smallest possible value of @ is 


. The joint p.d.f. of X; and X9 is 


1 1 
exp |——3 (a1 — byp)? — —5 (22 — bop)? }. 
270102 207 


2 
205 


If we let L() denote the logarithm of this expression, and solve the equation dL()/du = 0, we find 
that 


o3b1 21 + oz boxe 


US G3 + 2b 


. Since [(a +1) = al (a), it follows that I’(a+ 1) = al’(a) +I (a). Hence, 


_ Mati) _ ala) l(a) 
Wia+1) = I(a+1) T(a+1)  T(a+1) 
_ £@) 24. pide 
= Taye va) +7. 


. The joint p.d.f. of X1, Xo, X3 is 


Fle) = ges (51) “ager (—a9%2) ager (aa) arr (+345) al 
== — — —_— -—exp|——273] = —; - —+—)-|. 
gee grey ae ek Oe a) ae ae oe ee Be 
(a) By solving the equation 0 log(f)/00 = 0, we find that 

1 1 1 
~{|X,+=Xo+ =X3 }. 
I oe eee s) 
(b) In terms of ~, the joint p.d.f. of X1, X2, X3 is 


6= 


3 


fe |v) = exp |— (a1 + 5 + 5a), 


Since the prior p.d.f. of w is 


E(w) x b** exp(—BY), 


it follows that the posterior p.d.f. is 


ECW | @) o F(a WIE) 2 YU"? exp |— (B+ au + 500+ Zen) vd]. 


Hence, the posterior distribution of w is the gamma distribution with parameters a+ 3 and 64+ 21 + 
x2/2 + x3/3. 
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8. The joint p.f. of X2,...,Xn41 is the product of n factors. If X; = 7; and Xj41 = Xj41, then the ith 
factor will be the transition probability of going from state x; to state 41 (i =1,...,n). Hence, each 
transition from s; to s; will introduce the factor 9, each transition from s; to sg will introduce the 
factor 1 — 6, and every other transition will introduce either the constant factor 3/4 or the constant 
factor 1/4. Hence, if A denotes the number of transitions from s; to s; among the n transitions and B 
denotes the number from s; to sg, then the joint p.f. of X2,...,Xn41 has the form (const.) 64(1 — 6)?. 
Therefore, this joint p.f., or likelihood function, is maximized when 


ae oe) 


9. The posterior p.d.f. of 6 given X = =x satisfies the relation 
&(0 | x) x f(a | O)E(0) « exp(—6), for 6 > x. 
Hence, 


0 otherwise. 


c@le)={ exp(z—0) for @> a2, 


(a) The Bayes estimator is the mean of this posterior distribution, 6 = x + 1. 


(b) The Bayes estimator is the median of this posterior distribution, 6=ar+ log 2. 


10. In this exercise, 9 must lie in the interval 1/3 < @ < 2/3. Hence, as in Exercise 3 of Sec. 7.5, the M.L.E. 


of 0 is 
es ee 2 
Xn i= nia 
“ 1 2s iL 
d= 3 if Xn < o 
2 — 2 
= it X, >: 
a es 


It then follows from Sec. 7.6 that 8 = 36 — 1. 


11 1 al 
11. Under these conditions, X has a binomial distribution with parameters n and 0 = a5 + oleae + 5P 
Since 0 < p < 1, it follows that 1/4 < 0 < 3/4. Hence, as in Exercise 3 of Sec. 7.5, the M.L.E. of 6 is 
xX ,.1 ,-X 2-3 
— if-<—<-, 
n 47 n~ 4 
a 1 xX 1 
p24 = age 
4 °n~ 
3 if xX . 3 
— if—>-. 
4 n 4 


It then follows from Theorem 7.6.1 that the M.L.E. of p is p = 2(0 — 1/4). 


12: 


13. 


14. 


15. 


16. 


arg 
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The prior distribution of 0 is the Pareto distribution with parameters x9 = 1 and a = 1. Therefore, 
it follows from Exercise 18 of Sec. 7.3 that the posterior distribution of @ will be a Pareto distribution 
with parameters a +n and max{xo,2%1,...,%n}. In this exercise n = 4 and max{zo,7,...,%,} = 1. 
Hence, the posterior Pareto distribution has parameters a = 5 and zp = 1. The Bayes estimate of 0 
will be the mean of this posterior distribution, namely 

‘ o° 5 5 

d= / 6 —d0=-. 

1 66 4 

The Bayes estimate of @ will be the median of the posterior Pareto distribution. This will be the value 
m such that 

1 se 5 1 

—= — dO = —. 

2 m 68 m? 
Hence, 6=m=2'. 


The joint p.d.f. of X1,...,X, can be written in the form 


full, 0) = 8” exp (ws pS «| 


i=1 
n 
for min{z1,...,%»,} > 0, and f,(a|G,@) = 0 otherwise. Hence, by the factorization criterion, yO 
i=1 
and min{X,,...,X,} is a pair of jointly sufficient statistics and so is any other pair of statistics that 
is a one-to-one function of this pair. 


n a+l 

The joint p.d.f. of the observations is a” x9”*/ (11 «| for min{2j,...;2,} > mp. This pad.t. is 
i=1 

maximized when xo is made as large as possible. Thus, 


Lo = min{X), wae yf, b: 


Since a is known in Exercise 15, it follows from the factorization criterion, by a technique similar to 
that used in Example 7.7.5 or Exercise 12 of Sec. 7.8, that min{X),...,X,} is a sufficient statistic. 
Thus, from Theorem 7.8.3, since the M.L.E. #po is a sufficient statistic, it is a minimal sufficient statistic. 


It follows from Exercise 15 that %) = min{X,,..., X,,} will again be the M.L.E. of xo , since this value 
of x9 maximizes the likelihood function regardless of the value of a. If we substitute <9 for xp and let 
L(a) denote the logarithm of the resulting likelihood function, which was given in Exercise 15, then 


L(a) = nloga+n alog x — (a +1) S> log a; 
i=1 


and 


dL(a) 


n 
n 
=— log%) — > 1 : 
aa yo ee S > log x; 


i=1 


Hence, by setting this expression equal to 0, we find that 


ioe 7 
a= (Zens tose) . 
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19. 


20. 
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22. 
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It can be shown that the pair of estimators 7% 9 and @ found in Exercise 17 form a one-to-one transform 
of the pair of jointly sufficient statistics T, and T> given in Exercise 3 of Sec. 7.8. Hence, %9 and @ 
are themselves jointly sufficient statistics. It now follows from Sec. 7.8 that #9 and @ must be minimal 
jointly sufficient statistics. 


The p.f. of X is 


f(a\n,p) = ("ora =p. 


The M.L.E. of n will be the value that maximizes this expression for given values of x and p. The ratio 
given in the hint to this exercise reduces to 


n+1 


So Se: 
ie (1p) 


Since R is a decreasing function of n, it follows that f(z|n,p) will be maximized at the smallest value 
of n for which R < 1. After some algebra, it is found that R < 1 if and only if n > x/p—1. Hence, n 
will be the smallest integer greater than x/p —1. If x/p — 1 is itself an integer, then x/p —1 and a/p 
are both M.L.E.’s. 


The joint p.d-f. of X; and XQ is 1/(407) provided that each of the observations lies in either the interval 
(0,4) or the interval (20,30). Thus, the M.L.E. of 6 will be the smallest value of 6 for which these 
restrictions are satisfied. 


(a) If we take 36 = 9, or 6= 3, then 6 will be as small as possible, and the restrictions will be satisfied 
because both observed values will lie in the interval (20, 36). 


(b) It is not possible that both X, and X49 lie in the interval (20,30), because for that to be true it is 
necessary that X2/X, < 3/2. Here, however, X2/X, = 9/4. Therefore, if we take @ = 4, then 6 
will be as small as possible and the restrictions will be satisfied because Xj, will lie in the interval 
(0,6) and X9 will lie in (26,36). 

(c) It is not possible that both X; and X lie in the interval (20,30) for the reason given in part (b). 
It is also not possible that Xj lies in (0,0) and X2 lies in (20,36), because for that to be true it is 
necessary that X2/X > 2. Here, however, X2/X, = 9/5. Hence, it must be true that both X1 
and X lie in the interval (0,0). Under this condition, the smallest possible value of 6 is 6 = 9. 


The Bayes estimator of # is the mean of the posterior distribution of 6, and the expected loss or M.S.E. 
of this estimator is the variance of the posterior distribution. This variance, as given by Eq. (7.3.2), is 


Be. (100)(25) _ 100 
1 100+ 25n n+4' 


Hence, n must be chosen to minimize 


By setting the first derivative equal to 0, it is found that the minimum occurs when n = 16. 


n 


It was shown in Example 7.7.2 that T= y X; is a sufficient statistic in this problem. Since the sample 


i=1 
variance is not a function of T alone, it follows from Theorem 7.9.1 that it is inadmissible. 


Chapter 8 


Sampling Distributions of Estimators 


8.1 The Sampling Distribution of a Statistic 


Solutions to Exercises 


1, The c.d.f. of U = max{X7,...,X,} 18 


0 for u <0, 
Fiuy=* (ufo) fort <a < 8, 
1 for u > 0. 


Since U < @ with probability 1, the event that |U — 0| < 0.16 is the same as the event that U > 0.906. 
The probability of this is 1— F'(0.99) = 1—0.9”. In order for this to be at least 0.95, we need 0.9" < 0.05 
or n > log(0.05)/log(0.9) = 28.43. So n > 29 is needed. 


2. It is known that X,, has the normal distribution with mean @ and variance 4/n. Therefore, 
E9(|Xn — 6|) = Varg(Xn) = 4/n, 


and 4/n < 0.1 if and only if n > 40. 


3. Once again, X,, has the normal distribution with mean @ and variance 4 /n. Hence, the random variable 
Z = (Xp, — 9)/(2/\/n) will has the standard normal distribution. Therefore, 


Ey((X,—6l) = Pi |Z) == fir a ex (-2?/2)de = 24] [ren /2)ae 
2 
= 24/—. 


But 2,/2/(n7) < 0.1 if and only if n > 800/77 = 254.6. Hence, n must be least 255. 
4. If Z is defined as in the solution of Exercise 3, then 
Pr(|Xn — 0| < 0.1) = Pr(|Z| < 0.05./n) = 26(0.05/n) — 1. 


Therefore, this value will be at least 0.95 if and only if ®(0.05,/n) > 0.975. It is found from a table of 
values of ® that we must have 0.05,/n > 1.96. Therefore, we must have n > 1536.64 or, since n must 
be an integer, n > 1537. 
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. When p = 0.2, the random variable Z, = nX,y, will have a binomial distribution with parameters n and 


p= 0.2, and 
Pr([Xn —p| < 0.1) = Pr(O0.1n < Z, < 0.3n). 


The value of n for which this probability will be at least 0.75 must be determined by trial and error 
from the table of the binomial distribution in the back of the book. For n = 8, the probability becomes 


Pr(0.8 < Zg < 2.4) = Pr(Zg = 1) + Pr(Zg = 2) = 0.3355 + 0.2936 = 0.6291. 
For n = 9, we have 
Pr(0.9 < Z < 2.7) = Pr(Zq = 1) + Pr(Zo = 2) = 0.3020 + 0.3020 = 0.6040. 
For n = 10, we have 
Pr(1 < Zo < 3) = Pr(Zy9 = 1) + Pr(Zy9 = 2) + Pr(Zi0 = 3) = 0.2684 + 0.3020 + 0.2013 = 0.7717. 


Hence, n = 10 is sufficient. 


It should be noted that although a sample size of n = 10 will meet the required conditions, a sample 
size of n = 11 will not meet the required conditions. For n = 11, we would have 


Pr(1.1 < Z1 < 3.3) = Pr(Zy = 2) + Pr(Zy1 = 3). 


Thus, only two terms of the binomial distribution for n = 11 are included, whereas three terms of 
binomial distribution for n = 10 were included. 


. It is known that when p = 0.2, E(X;,) = p = 0.2 and Var(X,,) = (0.2)(0.8)/n = 0.16/n. Therefore, 


Z = (Xp,—0.2)/(0.4/./n) will have approximately a standard normal distribution. It now follows that 
Pr([Xn — p| < 0.1) = Pr(|Z| < 0.25./n) » 26(0.25/n) — 1. 


Therefore, this value will be at least 0.95 if and only if ©(0.25,/n) > 0.975 or, equivalently, if and only 
if 0.25,/n > 1.96. This final relation is satisfied if and only if n > 61.5. Therefore, the sample size must 
be n > 62. 


. It follows from the results given in the solution to Exercise 6 that, when p = 0.2, 


_ _ 0.16 
Ep (|Xn — p|?) = Var(Xn) = 7 


and 0.16/n < 0.01 if and only if n > 16. 


. For an arbitrary value of p, 


= = pil—p 

,((Kn — pl?) = Var(X,) = POP), 

This variance will be a maximum when p = 1/2, at which point its value is 1/(4n). Therefore, this 
variance will be not greater than 0.01 for all values of p(O < p< 1) if and only if 1/(4n) < 0.01 or, 


equivalently, if and only if n > 25. 


. The M.L.E. is 6 = n/T, where T was shown to have the gamma distribution with parameters n and 


0. Let G(-) denote the c.d.f. of the sampling distribution of T. Let H(-) be the c.d-f. of the sampling 
distribution of 6. Then H(t) = 0 for t < 0, and for t > 0, 
n 


H() =Pr6<t)=Pr(Z<t)=Pr(72 7) =1-6(2). 


t 
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8.2 The Chi-Square Distributions 


Commentary 


If one is using the software R, then the functions dchisq, pchisq, and qchisq give the p.d-f., the c.d-f., 
and the quantile function of y? distributions. The syntax is that the first argument is the argument of the 
function, and the second is the degrees of freedom. The function rchisq gives a random sample of y? random 
variables. The first argument is how many you want, and the second is the degrees of freedom. All of the 
solutions that require the calculation of y? probabilites or quantiles can be done using these functions instead 
of tables. 


Solutions to Exercises 


1. The distribution of 207/0.09 is the x? distribution with 20 degrees of freedom. We can write Pr(T’ < 
c) = Pr(207'/0.09 < 20c/0.09). In order for this probability to be 0.9, we need 20c/0.09 to equal the 0.9 
quantile of the x? distribution with 20 degrees of freedom. That quantile is 28.41. Set 28.41 = 20c/0.09 
and solve for c = 0.1278. 


2. The mode will be the value of x at which the p.d.f. f(a) is a maximum or, equivalently, the value of x 
at which log f(x) is a maximum. We have 


log f(x) = (const.) + (= — 1) log x — . 


If m = 1, this function is strictly decreasing and increases without bound as x > 0. If m = 2, this 
function is strictly decreasing and attains its maximum value when x = 0. If m > 3, the value of x at 
which the maximum is attained can be found by setting the derivative with respect to x equal to 0. In 
this way it is found that 7 = m — 2. 


3. The median of each distribution is found from the table of the y? distribution given at the end of the 
book. 


(a)m=1 4 


fo) 
BR 
a 
= 
x 


mode 
median 
mean 


Figure $.8.1: First figure for Exercise 3 of Sec. 8.2. 
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(b)m=2 , 


fo) 
= 
ice) 
ye) 
x< 


mode 
median w 
mean 


fo) 
a 
o 
NI 
wo 
x< 


mode 
median 
mean 


(d)m=4 


fo) 

ye) 
wo 
o 
aS 
x< 


median 
mean 


Figure §$.8.2: Second figure for Exercise 3 of Sec. 8.2. 
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4. Let r denote the radius of the circle. The point (X,Y) will lie inside the circle if and only if 
X?4Y?2 <r?. Also, X? + Y? has a x? distribution with two degrees of freedom. It is found from the 
table at the end of the book that Pr(X? cee aos 9.210) = 0.99. Therefore, we must have r? > 9.210. 


5. We must determine Pr(X? + Y?+ Z? <1). Since X?+ Y?+4 Z? has the x? distribution with three 
degrees of freedom, it is found from the table at the end of the book that the required probability is 
slightly less than 0.20. 


6. We must determine the probability that, at time 2, X? + Y*+Z? < 1607. At time 2, each of the 
independent variables X, Y, and Z will have a normal distribution with mean 0 and variance 20°. 
Therefore, each of the variables X/ V20, Y/V20, and Z/\/2c will have a standard normal distribution. 
Hence, V = (X*2 + Y? + Z?)/(207) will have a x? distribution with three degrees of freedom. It now 
follows that 


PYX? LY" 7? < 16e SPV <8): 


It can now be found from the table at the end of the book that this probability is slightly greater than 
0.95. 


7. By the probability integral transformation, we know that T; = F;(X;) has a uniform distribution on 
the interval [0,1]. Now let 7; = — log T;. We shall determine the p.d.f. g of Z;. The p.d.f. of T; is 


1 forO<t<1, 
f= 0 otherwise. 


Since T; = exp(—Z;), we have dt/dz = — exp(—z). Therefore, for z > 0, 


g(2) = f(exp(—2))| F] = exp(-2). 


Thus, it is now seen that Z; has the exponential distribution with parameter @ = 1 or, in other words, 

the gamma distribution with parameters a = 1 and 6 = 1. Therefore, by Exercise 1 of Sec. 5.7, 2Z; has 
n 

the gamma distribution with parameters a = 1 and 8 = 1/2. Finally, by Theorem 5.7.7 S. 2Z; will 
i=l 

have the gamma distribution with parameters a = n and 3 = 1/2 or, equivalently, the y? distribution 

with 2n degrees of freedom. 


8. It was shown in Sec. 3.9 that the p.d.f. of W is as follows, for 0 < w <1: 
hi(w) = n(n —1)w" 7(1 — w). 


Let X = 2n(1—W). Then W =1—- X/(2n) and dw/dx = —1/(2n). Therefore, the p.d-f. g,(z) is as 
follows, for 0 < x < 2n: 


n(-§) [zl=-0(-5)" (E)@) 
= @CS)G-s) O-8)- 


9n(z) 
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Now, as n > oc, 


n—- 


1 —2 
— 1 and (i-=) > 1. 
n 2n 


Also, for any real number t, (1+1t/n)” — exp(t). Therefore, (1 — 2/(2n))” — exp(—z/2). Hence, for 
x>0, 


Gn(x) > z exp(—2/2). 


This limit is the p.d.f. of the y? distribution with four degrees of freedom. 


. It is known that X, has the normal distribution with mean p and variance o7/n. Therefore, (Xp — 


u)/(a//n) has a standard normal distribution and the square of this variable has the x? distribution 
with one degree of freedom. 


Each of the variables X; + Xo + X3 and X4+ X5 + X6 will have the normal distribution with mean 
0 and variance 3. Therefore, if each of them is divided by V3, each will have a standard normal 
distribution. Therefore, the square of each will have the x? distribution with one degree of freedom 
and the sum of these two squares will have the y? distribution with two degrees of freedom. In other 
words, Y/3 will have the y? distribution with two degrees of freedom. 


The simplest way to determine the mean is to calculate E(X!/?) directly, where X has the y? distri- 
bution with n degrees of freedom. Thus, 


pity = [288d fo gep(-2/2)d0 = 
0 


* p-D)/2 exp (— 
IPT (n/D) : zt exp(—2/2)dx 


2"/2T'(n/2) Jo 
Va[(n + 1)/2] 
T(n/2) 


1 


TENCE a+ D/2T (nm + 1)/2] = 


For general o?, 


(S.8.1) 


10 x 0. 
Pr(Y < 0.09) = Pr (w Z —S) 


o2 


where W = 10Y/o? has the y? distribution with 10 degrees of freedom. The probability in (S.8.1) is 
at least 0.9 if and only if 0.9/c? is at least the 0.9 quantile of the x? distribution with 10 degrees of 
freedom. This quantile is 15.99, so 0.9/0? > 15.99 is equivalent to a2 < 0.0563. 


We already found that the distribution of W = noe /o? is the x? distribution with n degrees of freedom, 
which is also the gamma distribution with parameters n/2 and 1/2. If we multiply a gamma random 
variable by a constant, we change its distribution to another gamma distribution with the same first 
parameter and the second parameter gets divided by the constant. (See Exercise 1 in Sec. 6.3.) Since 


o2 = (o?/n)W, we see that the distribution of o is the gamma distribution with parameters n/2 and 


n/(207). 
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8.3 Joint Distribution of the Sample Mean and Sample Variance 


Commentary 


This section contains some relatively mathematical results that rely on some matrix theory. We prove 
the statistical independence of the sample average and sample variance. We also derive the distribution 
of the sample variance. If your course does not focus on the mathematical details, then you can safely 
cite Theorem 8.3.1 and look at the examples without going through the orthogonal matrix results. The 
mathematical derivations rely on a calculation involving Jacobians (Sec. 3.9) which the instructor might have 
skipped earlier in the course. 


Solutions to Exercises 


1. We found that U = no? /o? has the x? distribution with n — 1 degrees of freedom, which is also the 
gamma distribution with parameters (n — 1)/2 and 1/2. If we multiply a gamma random variable by 
a number c, we change the second parameter by dividing it by c. So, with ¢ = o?/n, we find that 
cU =o? has the gamma distribution with parameters (n — 1)/2 and n/(207). 


2. It can be verified that the matrices in (a), (b), and (e) are orthogonal because in each case the sum of 
the squares of the elements in each row is 1 and the sum of the products of the corresponding terms 
in any two different rows is 0. The matrix in (c) is not orthogonal because the sum of squares for the 
bottom row is not 1. The matrix in (d) is not orthogonal because the sum of the products for rows 1 
and 2 (or any other two rows) is not 0. 


3. (a) Consider the matrix 


Ae be = 


a2 


For A to be orthogonal, we must have a? + a} = 1 and —za, + —xay = 0. It follows from the 


FA + 


second equation that a, = —a2 and, in turn, from the first equation that a? = 1/2. Hence, either 
the pair of values a; = 1//2 and aj = —1/V/2 or the pair a; = -1//2 and aj = 1/2 will make 
A orthogonal. 


(b) Consider the matrix 


1/V3 1/V¥3— 1/3 


A= ay ag a3 
by by b3 


For A to be orthogonal, we must have 


2 2 2 
aj tag +a3=1 


and 
1 1 1 
Van + 3” + 73" = 0. 
Therefore, ag = —a, — ag and it follows from the first equation that 


a? + a2 + (ay + a2)? = 20? + 202 + 2aja2 = 1. 


Any values of a, and ag satisfying this equation can be chosen. We shall use a, = 2/6 and 


az = —1/V6. Then a3 = —1/V6. 
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Finally, we must have b7 + b3 + 63 = 1 as well as 


a ade eee 

Bo wa a 
and 

2 il! 1 

Ea Weeieas OO ee 

ve 6° 6" 


This final pair of equations can be rewritten as 
by + b3 = —b, and bo 4+ b3 = 2b). 


Therefore, bj = 0 and by = —b3. Since we must have b3 + b? = 1, it follows that we can use either 
by = 1/2 and b3 = —1//2 or by = —1//2 and b3 = 1/V2. Thus, one orthogonal matrix is 


1/V31/V¥3 1/V3 
A=|2//6 -1/V6 -1/V6 
0 1/V2 —1/V2 


4. The 3 x 3 matrix of coefficients of this transformation is 


0.8 0.6 0 
A=1|(0.3)/2 -(0.4)/2 -(0.5)/2 
(0.3)/2 —(0.4)V/2 (0.5)/2 


Since the matrix A is orthogonal, it follows from Theorem 8.3.4 that Y,, Y2, and Y3 are independent 
and each has a standard normal distribution. 


5. Let Z; = (X; — w)/o for i=1,2. Then Z; and Z> are independent and each has a standard normal 
distribution. Next, let Y; = (Z, + Z2)/V2 and Yo = (Z, — Z)//2. Then the 2 x 2 matrix of coefficients 
of this transformation is 


ke 1//2s1/vV2 
~ a/f2 1/2] * 


Since the matrix A is orthogonal, it follows from Theorem 8.3.4 that Y; and Y> are also independent 
and each has a standard normal distribution. Finally, let Wy = X1 + Xq and Wj = X1 — X2. Then 
W, = V20Y; + 2u and W2= V2cY>. Since Y; and Y> are independent, it now follows from Exercise 15 
of Sec. 3.9 that W, and W2 are also independent. 


n X,— 2 
6. (a) Since (X; — 4)/o has a standard normal distribution for i=1,...,n, then W = »; aw 
i=1 
has the x? distribution with n degrees of freedom. The required probability can be rewritten as 
follows: 


pr(3 <W <n). 


Thus, when n= 16, we must evaluate Pr(8 < W < 32) = Pr(W < 32) — Pr(W < 8), Where W has 
the y? distribution with 16 degrees of freedom. It is found from the table at the end of the book 
that Pr(W < 32) = 0.99 and Pr(W < 8) = 0.05. 
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pai(Ay — Aq)? 
(b) By Theorem 8.3.1, V = drei i - An)" 


a 
The required probability can be rewritten as follows: 


has the x? distribution with n — 1 degrees of freedom. 


pr(3 <V <2n), 


Thus, when n= 16, we must evaluate Pr(8 < V < 32) = Pr(V < 32) — Pr(V <8), Where V has the 
x? distribution with 15 degrees of freedom. It is found from the table that Pr(V < 32) = 0.993 
and Pr(V < 8) = 0.079. 


7. (a) The random variable V = né?/o? has a x? distribution with n—1 degrees of freedom. The 
required probability can be written in the form Pr(V < 1.5n) > 0.95. By trial and error, it is 
found that for n = 20, V has 19 degrees of freedom and Pr(V < 30) < 0.95. However, for n = 21, 
V has 20 degrees of freedom and Pr(V < 31.5) > 0.95. 


(b) The required probability can be written in the form 
n 3n 3n n 
pr(Scvc) pr(v< 2) pr(v<i), 


where V again has the y? distribution with n — 1 degrees of freedom. By trial and error, it is 
found that for n = 12, V has 11 degrees of freedom and 


Pr(V < 18) —Pr(V < 6) = 0.915 — 0.130 < 0.8. 
However, for n = 13, V has 12 degrees of freedom and 
Pr(V < 19.5) — Pr(V < 6.5) = 0.919 — 0.113 > 0.8. 


8. If X has the x? distribution with 200 degrees of freedom, then it follows from Theorem 8.2.2 that X 
can be represented as the sum of 200 independent and identically distributed random variables, each 
of which has a x? distribution with one degree of freedom. Since E(X) = 200 and Var(X) = 400, 
it follows from the central limit theorem that Z = (X — 200)/20 will have approximately a standard 
normal distribution. Therefore, 


Pr(160 < X < 240) = Pr(—2 < Z < 2) © 20(2) —1 = 0.9546. 


9. The sample mean and the sample variance are independent. Therefore, the information that the sample 
variance is closer to ? in one sample than it is in the other sample provides no information about which 
of the two sample means will be closer to yz. In other words, in either sample, the conditional distribution 
of X,,, given the observed value of the sample variance, is still the normal distribution with mean pu 
and variance o?/n. 


8.4 The t Distributions 


Commentary 


In this section, we derive the p.d.f. of the ¢ distribution. That portion of the section (entitled “Derivation 
of the p.d.f.”) can be skipped by instructors who do not wish to focus on mathematical details. Indeed, 
the derivation involves the use of Jacobians (Sec. 3.9) that the instructor might have skipped earlier in the 
course. 

If one is using the software R, then the functions dt, pt, and qt give the p.d.f., the c.d.f., and the quantile 
function of ¢ distributions. The syntax is that the first argument is the argument of the function, and the 
second is the degrees of freedom. The function rt gives a random sample of ¢ random variables. The first 
argument is how many you want, and the second is the degrees of freedom. All of the solutions that require 
the calculation of t probabilites or quantiles can be done using these functions instead of tables. 
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Solutions to Exercises 


00 pyre x a2\ TetD/2 
1. E(x?) =e | a (+=) de = 2c | x (+=) dx, 
n 0 n 


—Co 


1/2 
where c = Tin +1)/2]_ If y is defined as in the hint for this exercise, then x = ( ny ) ana 
(nm)'/?T(n/2) 1-y 

d. 
= - ee. — y)~/?, Therefore, 

é —(n+1)/2 

E(X*) = Vatconst.) | es i+) y (1 — y)3?dy 
9 l-y 1l-y 


= n°/?(const.) i. yd — y)"YPdy 
(3/2) (n= 2)/2] _ 3) T(r = 2)/2] 
= ni” (const.) ay = nr /p (5) a 


= 8 (V8) ay aT 


Since E(X) = 0, it now follows that Var(X) = n/(n — 2). 
2. Since fi = X,, and G? = S?/n, it follows from the definition of U in Eq. (8.4.4) that 


Pr(ji > p+ko) = Pr (2 > 7 = Pr[U > k(n —1)”?]. 


Since U has the ¢ distribution with n — 1 degrees of freedom and n = 17, we must choose & such that 
Pr(U > 4k) = 0.95. It is found from a table of the t distribution with 16 degrees of freedom that 
Pr(U < 1.746) = 0.95. Hence, by symmetry, Pr(U > —1.746) = 0.95. It now follows that 4k = —1.746 
and k = —0.4365. 


3. X, +X» has the normal distribution with mean 0 and variance 2. Therefore, Y = (X; + X2)/V2 has 
a standard normal distribution. Also, Z = x? + X24 x? has the y? distribution with 3 degrees of 
Y 
freedom, and Y and Z are independent. Therefore, U = Zaye has the ¢ distribution with 3 degrees 
of freedom. Thus, if we choose c = \/3/2, the given random variable-will be equal to U. 


4, Let y= 2/2. Then 


ae 1 ps : we — 
[. (2+22)2 le “oy oe 
1 l.25 y? = 
= — i= 
72 eZ ( 4 - 
V3nl (3/2 1 ee 
= ee (=) [ g3(y)dy, 


where g3(y) is the p.d.f. of the ¢ distribution with 3 degrees of freedom. It is found from the table of 
this distribution that the value of the integral is 0.85. Hence, the desired value is 


3 1 
al) (2) (oss) = cee _ W085), 
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2 — 
5. Let X2 = (X1 + X2)/2 and $3 = S°(X; — X2)*. Then 


i=l 


- (X1 + X2)? = oe, 
(X1—X2)? 83 


It follows from Eq. (8.4.4) that U = /2X9/,/S3 has the t distribution with one degree of freedom. 
Since W = U?, we have 


Pr(W < 4) =Pr(-2 < U < 2) =2Pr(U < 2) —-1. 


It can be found from a table of the ¢ distribution with one degree of freedom that Pr(U < 2) is just 
slightly greater than 0.85. Hence, Pr(W < 4) = 0.70. 


6. The distribution of U = (20)!/?(X29 — »)/o’ is a t distribution with 19 degrees of freedom. Let v be 
the 0.95 quantile of this t distribution, namely 1.729. Then 


0.95 = Pr(U < 1.729) = Pr(Xo9 < w+ 1.729/(20)'/70'). 


It follows that we want c = 1.729/(20)!/? = 0.3866. 


7. According to Theorem 5.7.4, 


(27)1/2(m + 1/2)™ exp(—m — 1/2) 


= a 

oe T(m + 1/2) 
1/2 m—1/2 = 

fm 2D exp(=m) 


Taking the ratio of the above and dividing by m1/2, we get 


T(m+1/2) | i (20)'/?(m + 1/2)™ exp(—m — 1/2) 
moo T(m)m!/2——m=00 (27r)1/2(m) 1/2 exp(—m)ml/2 
= tim (TEE) ep-1/y 
= 4, 


where the last equality follows from Theorem 5.3.3 applied to (1+ 1/(2m))”. 


8. Let f be the p.d.f. of X and let g be the p.d.f. of Y. Define 


h(c) = Pr(-c < X <c)—Pr(-e< Y <c)= [ (f(z) — g(x) ]da. (S.8.2) 


—c 


Suppose that co can be chosen so that f(x) > g(x) for all —co < x < co and f(x) < g(x) for all |x| > co. 
It should now be clear that h(co) = max, h(c). To prove this, first let c > co. Then 


—co 


ho) = eo) + f (Fe) - a(a)lax + f (F@) - (aya. 


=C 
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Since f(x) — g(x) < 0 for all x in these last two integrals, h(c) < h(co). Similarly, if 0 < c < 9, 


n(c) = (co) — ff Lfte) — 9(e)ldz — f"1F(2) - 9(x)]de. 


—co 


Since f(x) —g(x) > 0 for all x in these last two integrals, h(c) < h(co). Finally, notice that the standard 
normal p.d.f. is greater than the t p.d.f. with five degrees of freedom for all —c < « < cif c= 1.63 and 
the normal p.d.f. is smaller that the t p.d.f. if |x| > 1.63. 


8.5 Confidence Intervals 


Commentary 


This section ends with an extended discussion of shortcomings of confidence intervals. The first paragraph 
on interpretation is fairly straightforward. Students at this level should be able to understand what the 
confidence statement is and is not saying. The long Example 8.5.11 illustrates how additional information that 
is available can be ignored in the confidence statement. Instructors should gauge the mathematical abilities 
of their students before discussing this example in detail. Although there is nothing more complicated than 
what has appeared earlier in the text, it does make use of multivariable calculus and some subtle reasoning. 

Many instructors will recognize the statistic Z in Example 8.5.11 as an ancillary. In many examples, con- 
ditioning on an ancillary is one way of making confidence levels (and significance levels) more representative 
of the amount of information available. The concept of ancillarity is beyond the scope of this text, and it 
is not pursued in the example. The example merely raises the issue that available information like Z is not 
necessarily taken into account in reporting a confidence coefficient. This makes the connection between the 
statistical meaning of confidence and the colloquial meaning more tenuous. 

If one is using the software R, remember that qnorm and qt compute quantiles of normal and ¢ distribu- 
tions. These quantilies are ubiquitous in the construction of confidence intervals. 


Solutions to Exercises 


1. We need to show that 


_ ifl1+y\ ¢ _ _1f1+y\ @ 


By subtracting X,, from all three sides of the above inequalities and then dividing all three sides by 
o/n'/? > 0, we can rewrite the probability in (S.8.3) as 


1f1t+y up—-Xn _ (=) 
ee Saat Ha An ee aie | 
pr|- ( 5 )<taR<e 5 ; 


The random variable (2 — Xpn)/(o/n'/?) has a standard normal distribution no matter what ps and 
o* are. And the probability that a standard normal random variable is between —6~1([1 + 7]/2) and 


-1((1 + 9]/2) is (1+ >)/2- [1-1 +)/2] =7. 


n 1/2 
2. In this exercise, X, = 3.0625, o’ = pres =Karia= »| = 0.5125 and o!/n1/? = 0.1812. There- 


i=1 
fore, the shortest confidence interval for yw will have the form 3.0625 — 0.1812c < uu < 3.0625 + 0.1812c. 
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If a confidence coefficient 7 is to be used, then c must satisfy the relation Pr(—c < U < c) = 7, where 
U has the ¢ distribution with n — 1 = 7 degrees of freedom. By symmetry, 


Pr(—e << U <c) = Pr(U <c) —Pr(U < —c) = Pr(U < ce) —([1—Pr(U <c)| =2Pr(U <c)—-1. 


As in the text, we find that c must be the (1+ 7)/2 quantile of the t distribution with 7 degrees of 
freedom. 


(a) Here y = 0.90, so (1 + y)/2 = 0.95. It is found from a table of the t distribution with 7 degrees of 
freedom that c = 1.895. Therefore, the confidence interval for 4 has endpoints 3.0625 — (0.1812) 
(1.895) = 2.719 and 3.0625 + (0.1812) (1.895) = 3.406. 


(b) Here y = 0.95, (1+ )/2 = 0.975, and c = 2.365. Therefore, the endpoints of the confidence 
interval for ys are 3.0625 — (0.1812)(2.365) = 2.634 and 3.0625 + (0.1813) (2.365) = 3.491. 


(c) Here y = 0.99, (1 + y)/2 = 0.995, and c = 3.499. Therefore, the endpoints of the interval are 2.428 
and 3.697. 


One obvious feature of this exercise, that should be emphasized, is that the larger the confidence 
coefficient y, the wider the confidence interval must be. 


. The endpoints of the confidence interval are X,,—co!/n'/? and X;,+co'/n'/?. Therefore, L = 20! /n'/? 
and L? = 4c?o""/n. Since 


has the y? distribution with n — 1 degrees of freedom, E(W) = n—1. Therefore, E(o’2) = E(a?W/[n— 
1]) = o?. It follows that E(L”) = 4c?o7/n. As in the text, c must be the (1+ 7)/2 quantile of the t 
distribution with n — 1 degrees of freedom. 


(a) Here, (1+ y)/2 = 0.975. Therefore, from a table of the t distribution with n — 1 = 4 degrees of 
freedom it is found that c = 2.776. Hence, c? = 7.706 and E(L”) = 4(7.706)o7/5 = 6.1607. 

(b) For the ¢ distribution with 9 degrees of freedom, c = 2.262. Hence, E(L?) = 2.0507. 

(c) Here, c = 2.045 and E(L?) = 0.5607. 


It should be noted from parts (a), (b), and (c) that for a fixed value of y, E(L?) decreases as the 
sample size n increases. 


(d) Here, y = 0.90, so (1+ y)/2 = 0.95. It is found that c = 1.895. Hence, E(L”) = 4(1.895)?0?/8 = 
1.800°. 


(e) Here, y = 0.95, so (1+ y)/2 = 0.975 and c = 2.365. Hence, E(L*) = 2.8007. 
(f) Here, y = 0.99, so (1+ y)/2 = 0.995 and c = 3.499. Hence, E(L”) = 6.1207. 


It should be noted from parts (d), (e), and (f) that for a fixed sample size n, E(L”) increases as 7 
increases. 


. Since /n(X, — 1)/o has a standard normal distribution, Pr |—1.96 < 


This relation can be rewritten in the form 


_ 1.96 — 1.96 
Pr (X, - va Se Nat =| ~ 0.95. 
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Therefore, the interval with endpoints X,, —1.960/./n and X,+1.960/./n will be a confidence interval 
for ys with confidence coefficient 0.95. The length of this interval will be 3.920/,/n. It now follows that 
3.920/,/n < 0.01o if and only if /n > 392. This means that n > 153664 of n = 153665 or more. 


n 
. Since SUX —X,)*/o? has a x? distribution with n — 1 degrees of freedom, it is possible to find 


i=l 
constants c; and cz which satisfy the relation given in the hint for this exercise. (As explained in this 
section, there are an infinite number of different pairs of values of c; and cz that might be used.) The 
relation given in the hint can be rewritten in the form 
1 — it _ 
Pr oe (Xi =) < o < 5 oe (Xi =x) = 
i=l i=l 


nm nm 
Therefore, the interval with endpoints equal to the observed values of SOX —Xn)?/co and SOX — 

i=1 =) 
Xn)7/c1 will be a confidence interval for o? with confidence coefficient +. 


. The exponential distribution with mean js is the same as the gamma distribution with a = 1 and 
nm 


@ = 1/p. Therefore, by Theorem 5.7.7, baee will have the gamma distribution with parameters 
i=1 . 
a=n and 6 = 1/u. In turn, it follows from Exercise 1 of Sec. 5.7 that So Xi/u has the gamma 
i=1 P 
distribution with parameters a=n and @ = 1. It follows from Definition 8.2.1 that o> /w has 
i=1 
the y? distribution with 2n degrees of freedom. Constants cj and cy which satisfy the relation given 
in the hint for this exercise will then each be 1/2 times some quantile of the x? distribution with 2n 
degrees of freedom. There are an infinite number of pairs of values of such quantiles, one corresponding 
to each pair of numbers q, > 0 and q. > 0 such that g2 — qj = y. For example, with q; = (1 — y)/2 
and qo = (1+ y)/2 we can let c; be 1/2 times the q; quantile of the y? distribution with 2n degrees of 
freedom for 7 = 1,2. It now follows that 


1 Tt 
Pr —SOXi<p<—S_X; =. 
C2 = Ol =a 


nm n 

Therefore, the interval with endpoints equal to the observed values of y X;/c2 and > X;/c, will be 
i=1 i=1 

a confidence interval for 4 with confidence coefficient y. 


. The average of the n = 20 values is %, = 156.85, and o’ = 22.64. The appropriate t distribution quantile 


is Ty' (0.95) = 1.729. The endpoints of the confidence interval are then 156.85 + 22.64 x 1.729/201/?. 
Completing the calculation, we get the interval (148.1, 165.6). 


. According to (8.5.15), Pr(|X2 —6| < 0.3/7 = 0.9) = 1, because 0.3 > (1 — 0.9)/2. Since Z = 0.9, we 


know that the interval between X, and X2 covers a length of 0.9 in the interval [6 —1/2,0+ 1/2]. Hence 
X» has to lie between 


6—1/2+4+ (6 1/2+-9) _ 9 _ gos aa 6—1/2+0.1+041/2 


= 60+ 0.05. 
5 5 + 0.05 


Hence X» must be within 0.05 of 0, hence well within 0.3 of 0. 
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9. (a) The interval between the smaller an larger values is (4.7, 5.3). 


(b) The values of @ consistent with the observed data are those between 5.3—0.5 = 4.8 and 4.7+0.5 = 
5.2. 


(c) The interval in part (a) contains the set of all possible 6 values, hence it is larger than the set of 
possible @ values. 


(d) The value of Z is 5.3 — 4.7 = 0.6. 
(e) According to (8.5.15), 
2x 0.1 


Pr([X2 — 6| < 0.1|Z = 0.6) = 


10. (a) The likelihood function is 


1 if4.8<4< 5.2, 
0 otherwise. 


f(a) = 
(See the solution to Exercise 9(b) to see how the numbers 4.8 and 5.2 arise.) The posterior p.d.f. 
of @ is proportional to this likelihood times the prior p.d.f., hence the posterior p.d.f. is 


cexp(—0.10) if48<6< 5.2, 
0 otherwise, 


where c is a constant that makes this function into a p.d.f. The constant must satisfy 


5.2 
a) exp(—0.16)dé = 1. 
4 


8 
Since the integral above equals 10[exp(—0.48)—exp(—0.52)] = 0.2426, we must have c = 1/0.2426 = 
4.122. 
(b) The observed value of X29 is Z2 = 5. So, the posterior probability that |@ — Z| < 0.1 is 
5.1 


4.122 exp(—0.16)d6 = 41.22[/exp(—0.49) — exp(—0.51)] = 0.5. 
4.9 


(c) Since the interval in part (a) of Exercise 9 contains the entire set of possible 0 values, the posterior 
probability that @ lies in that interval is 1. 


(d) The posterior p.d.f. of 6 is almost constant over the interval (4.8,5.2), hence the c.d.f. will be 
almost linear. The function in (8.5.15) is also linear. Indeed, for c < 0.2, the posterior probability 
of |6 — 5| < c equals 


=¢ 


5+¢e 
| 4.122 exp(—0.10)d0 = 41.22 exp(—0.5)[exp(0.1c) — exp(—O0.1c)| 
5 
~ 25 x 2x 0.1le = de. 
Since z = 0.6 in this example, 5c = 2c/(1 — z), the same as (8.5.15). 


11. The variance stabilizing transformation is a(x) = arcsin(x!/?), and the approximate distribution of 
a(X,,) is the normal distribution with mean a(p) and variance 1/n. So, 


Pr (aresin(Xy/”) — ®°1([1 + 9]/2)n-¥/? < aresin p'/? < aresin(Xy/”) + ®71((1 + 9]/2)n-"/?) ~ 9. 


This would make the interval with endpoints 


arcsin(zl/”) + @-1([1 + y]/2)n7V/? (S.8.4) 
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an approximate coefficient 7 confidence interval for arcsin(p'/?). The transformation a(a) has an inverse 
a~!(y) =sin?(y) for 0 < y < 7/2. If the endpoints in (S.8.4) are between 0 and 7/2, then the interval 
with endpoints 


sin? (arcsin(z}/”) +6 711+ /2)n-¥/?) (8.8.5) 


will be an approximate coefficient y confidence interval for p. If the lower endpoint in (S.8.4) is negative 
replace the lower endpoint in (S.8.5) by 0. If the upper endpoint in (S.8.4) is greater than 7/2, replace 
the upper endpoint in (S.8.5) by 1. With these modifications, the interval with the endpoints in (S.8.5) 
becomes an approximate coefficient y confidence interval for p. 


12. For this part of the proof, we define 


A 


r (G-(q2),X) ) 
B= r(G(y1),X). 


If r(v, x) is strictly decreasing in v for each x, we have 

V(X,0) < cif and only if g(@) > r(c, X). (S.8.6) 
Let c = G-1(7) in Eq. (S.8.6) for each of i = 1,2 to obtain 

Pr(g(9) > B)= m1, Pr(g(@) > A) = 12. (S.8.7) 
Because V has a continuous distribution and r is strictly decreasing, 

Pr(A = 9(9)) = Pr(V(X, 0) = G“*(42)) = 0, 


and similarly Pr(B = g(@)) = 0. The two equations in (8.8.7) combine to give Pr(A < g(0) < B) =¥. 


8.6 Bayesian Analysis of Samples from a Normal Distribution 


Commentary 


Obviously, this section should only be covered by those who are treating Bayesian topics. One might find it 
useful to discuss the interpretation of the prior hyperparameters in terms of amount of information and prior 
estimates. In this sense Ay and 2ag represent amounts of prior information about the mean and variance 
respectively, while zg and $9/ao are prior estimates of the and variance respectively. The corresponding 
posterior estimates are then weighted averages of the prior estimates and data-based estimates with weights 
equal to the amounts of information. The posterior estimate of variance, namely {)/a, is the weighted 
average of Gy/ao (with weight 2a9), o? (with weight n —1), and nAg(En — Ho)?/(Ao + 2) (with weight 1). 
This last term results from the fact that the prior distribution of the mean depends on variance (precision), 
hence how far %,, is from po tells us something about the variance also. 

If one is using the software R, the functions qt and pt respectively compute the quantile function and 
c.d.f. of a ¢ distribution. These functions can replace the use of tables for some of the calculations done in 
this section and in the exercises. 
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Solutions to Exercises 


i 


Since X has the normal distribution with mean jz and variance 1/7, we know that Y has the normal 
distribution with mean aj + b and variance a?/r. Therefore, the precision of Y is t/a?. 


. This exercise is merely a restatement of Theorem 7.3.3 with 6 replaced by 4, a? replaced by 1/rT, 


replaced by fig, and v? replaced by 1/\g. The precision of the posterior distribution is the reciprocal 
of the variance of the posterior distribution given in that theorem. 


. The joint p.d-f. f,(a|r) of X1,...,Xn is given shortly after Definition 8.6.1 in the text, and the prior 


p.d.f. €(7) is proportional to the expression (7) in the proof of Theorem 8.6.1. Therefore, the posterior 
p.d.f. of 7 satisfies the following relation: 


fr le) mm gle) ric) ar exp { [-5) 


7° 
eee 


oF] ete 


+5 - rt 


= oot(n/2)— ‘exp {- 


It can now be seen that this posterior p.d.f. is, except for a constant factor, the p.d.f. of the gamma 
distribution specified in the exercise. 


. The posterior distribution of 7, after using the usual improper prior, is the gamma distribution with 


parameters (n—1)/2 and s2/2. Now, V is a constant (n—1)o? times 7, so V has the gamma distribution 
with parameters (n — 1)/2 and (s2/2)/[(n — 1)o'2] = 1/2. This last gamma distribution is also known 
as the x? distribution with n — 1 degrees of freedom. 


. Since E(r) = ag/8o = 1/2 and Var(r) = ag/6? = 1/3, then ap = 2 and Bp = 4. Also, wo = E(u) = —5. 


Finally, Var(w) = 8o/|Ao(ao — 1)] = 1. Therefore, Ao = 4. 


. Since E(t) = ao/8o0 = 1/2 and Var(r) = ao/6? = 1/4, then ap = 1 and fp = 2. But Var() is finite 


only if a > —1. 


. Since E(r) = ao/Bo = 1 and Var(r) = ao/6?2 = 4, then ap = Bo = 1/4. But E(u) exists only if 


ao > 1/2. 


. It follows from Theorem 8.6.2 that the random variable U = (yw — 4)/4 has the ¢ distribution with 


2a9 = 2 degrees of freedom. 
(a) Pri > 0) = Pr(¥Y >—1)=] PHY <1) = 79: 

(b) 
Pr(0.736 < pu < 15.680) 


Pr(—0.816 < Y < 2.920) 

= Pr(Y < 2.920) —Pr(Y < —0.816) 
Pr(Y < 2.920) — [1 —Pr(Y < 0.816)] 
= 0.95 —(1—0.75) = 0.70. 


(a) The posterior hyperparameters are computed in the example. The degrees of freedom are 2a; = 22, 
so the quantile from the ¢ distribution is T5'({1 + .9]/2) = 1.717, and the interval is 


Pr 
AQ 


50925.37 
20 x 11 


1/2 1/2 
[iy £ 1.717 ( ) = 183.95 + 1.717 ( ) = (157.83, 210.07). 
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10. 


aI 


12. 


13. 


14. 
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(b) This interval has endpoints 182.17 + (88678.5/[17 x 18])!/27;71(0.95). With T;7'(0.95) = 1.740, we 
get the interval (152.55, 211.79). 


Since E(T) = ag/fo = 2 and 


Var(r) = 53 = Er?) — [B(r)P? = 1, 


then a9 = 4 and fp = 2. Also wo = E(u) = 0. Therefore, by Eq. (14), Y = (20) 2 has a t distribution 
with 2a = 8 degrees of freedom. It is found from a table of the t distribution that Pr(|Y| < 0.706) = 0.5. 


0.706 
Therefore, Pr ( pu] < oar) = 0.5. It now follows from the condition given in the exercise that 
0 
0.706 
Qr,)72 = 1.412. Hence, Xo => 1/8. 


It follows from Theorem 8.6.1 that uw; = 80/81, A1 = 81/8, a, = 9, and 3; = 491/81. Therefore, if Eq. 
(8.6.9) is applied to this posterior distribution, it is seen that the random variable U = (3.877)(u—0.988) 
has the ¢ distribution with 18 degrees of freedom. Therefore, it is found from a table Pr(—2.101 < Y < 
2.101) = 0.95. An equivalent statement is Pr(0.446 < yw < 1.530) = 0.95. This interval will be the 
shortest one having probability 0.95 because the center of the interval is 41, the point where the p.d-f. 
of w is a maximum. Since the p.d.f. of w decreases as we move away from jy in either direction, it 
follows that an interval having given length will have the maximum probability when it is centered at 
M1. 


Since E(t) = ag/Go = 1 and Var(r) = ao/6?2 = 1/3, it follows that ag = Bo = 3. Also, since 
the distribution of 4 is symmetric with respect to fio and we are given that Pr(y > 3) = 0.5, then 
Lio = 3. Now, by Theorem 8.6.2, U = Ny? (ue — 3) has the ¢ distribution with 2a9 = 6 degrees of 


freedom. It is found from a table that Pr(Y < 1.440) = 0.90. Therefore, Pr(Y > —1.440) = 0.90 and 


1.440 
it follows that Pr (: a 73 = 0.90. It now follows from the condition given in the exercise that 
Xo 
1.440 
3 — 7 = 0.12. Hence, Ag = 1/4, 
AG 


It follows from Theorem 8.6.1 that 1, = 67/33, A, = 33/4, ay = 7, and 6; = 367/33. In calculating 
the value of G1, we have used the relation 


nm 
De rr ee ee 
i=1 


i=l 


If Theorem 8.6.2 is now applied to this posterior distribution, it is seen that the random variable 
U = (2.279)(u — 2.030) has the ¢ distribution with 14 degrees of freedom. Therefore, it is found from a 
table that Pr(—2.977 < Y < 2.977) = 0.99. An equivalent statement is Pr(0.724 < yu < 3.336) = 0.99. 


The interval should run between the values pu + (61/[A1a1])!/ ge Aiea (0.95). The values we need are 
available from the example or the table of the ¢ distribution: py = 1.345, 6, = 1.0484, A, = 11, 
a, = 5.5, and T,;'(0.95) = 1.796. The resulting interval is (1.109, 1.581). This interval is a bit wider 
than the confidence interval in Example 8.5.4. This is due mostly to the fact that (3, /a1)!/2 is somewhat 
larger than o’. 
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(7.2 — 3.5)°) = 92.73. 


15. (a) The posterior hyperparameters are 
2x 3.5+11 x 7.2 
fee ay ie 
2+ 11 
Me = Sei = 15, 
11 
a = 245575, 
f= 1 5 (20.3 += 
ne a Oe ii 
(b) The interval should run between the values 11 4 
t distribution in the book, we obtain Tys'(0.975 
16. (a) The average of all 30 observations is %39 = 1.4 
ple 8.6.2, we obtain 
1 1.442 
ig 1x1+30x 1442 _ 1.428, 
1+ 30 
4 = 1430= 31, 
30 
a = 05+5= 15.5, 
1 1 x 30 
= 0.56+-—- | 2.671 
Pa 73 ( * 1430 


E (81/[Arai]) 4? T55 (0.975). From the table of the 
) = 2.131. The interval is then (5.601, 7.659). 


42 and s3) = 2.671. Using the prior from Exam- 


(1.442 — 1)?) = 1.930. 


The posterior distribution of yw and 7 is a joint normal-gamma distribution with the above hyper- 


parameters. 


(b) The average of the 20 new observations is %29 


— 1.474 and s3) = 1.645. Using the posterior in 


Example 8.6.2 as the prior, we obtain the hyperparameters 


11 x 1.345 + 20 x 1.474 


aa 11 + 20 
Ae = Tio = 31. 
20 
a = 55+>= 155, 
1 11 x 20 
= {diol pape 
By 048 +3 ( oe 


The posterior hyperparameters are the same a 
that they must be the same when one updates 


17. Using just the first ten observations, we have %, = 


(1.474 — 1.345)*) = 1.930. 


s those found in part (a). Indeed, one can prove 


sequentially or all at once. 


1.379 and s? = 0.9663. This makes pu; = 1.379, 


Ay = 10, ay = 4.5, and 6; = 0.4831. The posterior distribution of 4 and 7 is the normal-gamma 


distribution with these hyperparameters 


18. Now, we use the hyperparameters found in Exercise 18 as prior hyperparameters and combine these 
with the last 20 observations. The average of the 20 new observations is F29 — 1.474 and s3) = 1.645. 


We then obtain 
10 x 1.379 + 20 x 1.474 


a 10 + 20 ’ 
di =. 10-430 36, 
D) 
oe 25% = = 145, 
1 10 x 20 
— 0.4831— (1.645 
Pr 9 ( * 70420 


(1.474 — 1.379)?) = 1.336. 
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Comparing two sets of hyperparameters is not as informative as comparing inferences. For example, a 
posterior probability interval will be centered at j, and have half-width proportional to (6/{a1\1])'/?. 
Since ; is nearly the same in this case and in Exercise 16 part (b), the two intervals will be centered 
in about the same place. The values of (8, /[a1,;])!/? for this exercise and for Exercise 16 part (b) are 
respectively 0.05542 and 0.06338. So we expect the intervals to be slightly shorter in this exercise than 
in Exercise 16. (However, the quantiles of the ¢ distribution with 31 degrees of freedom are a bit larger 
in this exercise than the quantiles of the ¢ distribution with 31 degrees of freedom in Exercise 16.) 
19. (a) For the 20 observations given in Exercise 7 of Sec. 8.5, the data summaries are %,, = 156.85 and 
s* = 9740.55. So, the posterior hyperparameters are 
0.5 x 150 + 20 x 156.85 
fy = eh ee 156.68, 
0.5 + 20 
At = 0.5+20= 20.5, 
20 
a = 1+ >=1h 
1 0.5 x 20 
= 44-—( 9740.55 + ———— (156.85 — 150) ) = 4885.7. 
A 2 ( + O54 20 ) ) 
The joint posterior of 4 and 7 is the normal-gamma distribution with these hyperparameters. 
(b) The interval we want has endpoints yi, + (81/[o11))/? Teen (0.95). The quantile we want is 
T9' (0.95) = 1.717. Substituting the posterior hyperparameters gives the endpoints to be a = 
148.69 and b = 164.7. 
20. The data summaries in Example 7.3.10 are n = 20, %o9 = 0.125. Combine these with 525 = 2102.9 to 
get the posterior hyperparameters: 
1x 0+ 20 x 0.125 
_@ = eee 00: 
1+ 20 
M = 14+20=21, 
20 
a = 14+—=11, 
2 
2102.9 20x1-~x (0.125 —0)? 
= 60 = 74:4. 
Pr r= G 2(1 + 20) 
(a) The posterior distribution of (4,7) is the normal-gamma distribution with the posterior hyperpa- 
rameters given above. 
(b) The posterior distribution of 
axe 
— 0.1190) = 0.4559( — 0.1190) = T 

(F=S) = 0.1190) (u ) 
is the ¢ distribution with 22 degrees of freedom. So, 

Pr(p > 1ja) = Pr[0.4559(u — 0.1190) > 0.4559(1 — 0.1190)] = Pr(T' > 0.4016) = 0.3459, 
where the final probability can be found by using statistical software or interpolating in the table 
of the ¢ distribution. 

8.7 Unbiased Estimators 
Commentary 


The subsection on limitations of unbiased estimators at the end of this section should be used selectively by 
instructors after gauging the ability of their students to understand examples with nonstandard structure. 
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Solutions to Exercises 


1. (a) The variance of a Poisson random variable with mean @ is also 6. So the variance is 0? = g(0) = 0. 


(b) The M.L.E. of g(@) = @ was found in Exercise 5 of Sec. 7.5, and it equals X,,. The mean of X,, is 
the same as the mean of each X;, namely 0, hence the M.L.E. is unbiased. 


2. Let E(X*) = By. Then 
E Ly> xh 21) ape ap =p 
no t el a a a k ke 


1 nm 
3. By Exercise 2, 6, = — y X? is an unbiased estimator of E(X?). Also, we know that 62 = —— y (X;- 
ne =15 
i=l i=1 
X,)? is an unbiased estimator of Var(X). Therefore, it follows from the hint for this exercise that 6, — do 
will be an unbiased estimator of [E(X)]?. 


4. If X has the geometric distribution with parameter p, then it follows from Eq. (5.5.7) that E(X) = 
(1—p)/p=1/p—1. Therefore, E(X +1) =1/p, which implies that X +1 is an unbiased estimator of 


1/p. 


5. We shall follow the hint for this exercise. If E[6(X)] = exp(A), then 


a . 6(x) exp(—A)A” 
exp(A) = BI6(X)] = S> 4(0) f(w |d) = Yo APO 
z=0 «z=0 . 
Therefore, 
= Ol) Am (AY? om ON 
2 a PN S a > io 
«z=0 z=0 «z=0 
Since two power series in A can be equal only if the coefficients of A” are equal for « = 0,1,2,..., if 
follows that 6(x) = 2” for x =0,1,2,.... This argument also shows that this estimator 6(X) is the 


unique unbiased estimator of exp(A) in this problem. 


6. The M.S.E. of 62 is given by Eq. (8.7.8) with c = 1/n and it is, therefore, equal to (2n —1)o*/n?. The 
M.S.E. of 6? is given by Eq. (8.7.8) with c = 1/(n — 1) and it is, therefore, equal to 204/(n — 1). Since 
(2n — 1)/n? < 2/(n —1) for every positive integer n, it follows that the M.S.E. of 62 is smaller than 


the M.S.E. of 67 for all values of js and o?. 


7. For any possible values x1,...,%p of X1,...,Xn, let y = 0_, 2}. Then 


EO Nieses 4) | = ‘3 6(£1,...,%n)p¥(1 — p)” %, 


where the summation extends over all possible values of 71,...,2%,. Since 
p¥(1—p)”-» is a polynomial in p of degree n, it follows that E[6(X1,...,Xp)] is the sum of a finite 
number of terms, each of which is equal to a constant 6(21,...,2n) times a polynomial in p of degree 


n. Therefore, E[5(X1,...,Xn)] must itself be a polynomial in p of degree n or less. The degree would 
actually be less than n if the sum of the terms of order p” is 0. 
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8. If E[d(X)] = p, then 
p= El6(X)] = 37 4(@)p(1 — 1)’. 
«2=0 


Therefore, }°72.9 6(x)(1 — p)” = 1. Since this relation must be satisfied for all values of 1 — p, it follows 
that the constant term (0) in the power series must be equal to 1, and the coefficient 6(x) of (1 — p)* 
must be equal to 0 for « = 1,2,.... 


9. If E[6(X)] = exp(—2A), then 


xz=0 
Therefore, 
 (x)A™ ee (-1)7A7 


Therefore, 6(X) = (—1)* or, in other world, 6(X) = 1 if x is even and 6(x) = —1 if x is odd. 


10. Let X denote the number of failures that are obtained before k successes have been obtained. Then X 
has the negative binomial distribution with parameters k and p, and N = X +k. Therefore, by Eq. 


(5.5.1), 
k= k-1 © p-1 fetk—1 . 
® (y=) =® (z5-7) ~ ya x Jka» 
— wa (at+k—2)! fe 
- dX ako? Ct -?) 
= pd (TE ae 
xz=0 


But the final summation is the sum of the probabilities for a negative binomial distribution with 
parameters k — 1 and p. Therefore, the value of this summation is 1, and E((k — 1]/[N —1]) =p. 
11. (a) E(0) =aE(Xm)+(1—a)E(¥n) = 00 + (1— 0) = 0. Hence, 6 is an unbiased estimator of 6 for 
all values of a, m and n. 


(b) Since the two samples are taken independently, X,, and Y,, are independent. Hence, 


Var(0) = a? Var(Xm) + (1 — a)? Var(¥n) = a (=!) ita? (2). 


Since 04, = 40%, it follows that 


(l= ] 2 


n 4a? 
Var(0) = = + ; On. 


By differentiating the coefficient of o2,, it is found that Var(@) is a minimum when a = m/(m+4n). 


12, (a) 


132 (a) 
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Let X denote the value of the characteristic for a person chosen at random from the total popu- 
lation, and let A; denote the event that the person belongs to stratum i (¢ = 1,...,k). 
Then 


k 


k 
p= EX) =>" EX | A) Pr(A,) = >) wai. 
i=l i=1 


Also, 


k k 
E(fi) = > piE(X:) a bare = ph 
=1 i=1 


Since the samples are taken independently of each other, the variables X1,..., X, are independent. 
Therefore, 
k oe k peor 
Var(ja) = S7 p? Var(X;) = > PE! 
i=1 in.‘ 
or *\ (pioi)? 
Hence, the values of ny,...,2, must be chosen to minimize v = 2 , subject to the con- 
: Ni 
=1 
k k-1 ‘ 
straint that SS n, =n. If we let np =n — pS n;, then 
i=l i=l 
) —(pjo;)? oR)? 
v = (Pio) 4 Pion) for i=1,...,k—1. 
On; ns ng 


When each of these partial derivatives is set equal to 0, it is found that n;/(p;o;) has the same value 
k 


k 
fori =1,...,k. Therefore, n; = cp;o; for some constant c. It follows that n = ‘o nj = Gy pyoy. 
j=l j=l 
k 
Hence, c = n/ S> p50; and, in turn, 
j=l 
feet NPiPi 
a Sn 
Do P59; 
j=l 
This analysis ignores the fact that the values of n,,...,m, must be integers. The integers n1,..., nx 
for which v is a minimum would presumably be near the minimizing values of n1,...,n,% which 


have just been found. 


By Theorem 4.7.1, 
B(6) = E[E(6|T)] = E(6). 


Therefore, 6 and 69 have the same expectation. Since 6 is unbiased, E(d) = 6. Hence, E (69) = 0 
also. In other words, do is also unbiased. 


Let Y = 6(X) and X =T in Theorem 4.7.4. The result there implies that 
Varg(d(X )) = Varg(do(X)) + Eg Var(d(X)|T). 
Since Var(d(X)|T) > 0, so too is Eg Var(6(X)|T), so Varg(d(X)) > Varg(do(X)). 


262 Chapter 8. Sampling Distributions of Estimators 
14. For 0 < y < @, the c.d-f. of Y;, is 


F(y| 6) =Pr¥ <y| 6) =Pr(Xi Sys Xe Sy 10) =(4) 


Therefore, for 0 < y < 0, the p.d.f. of Y;, is 


n—-1 


d 
fy 9) = 7 Fly | 8) =~, 


It now follows that 


0 nan-t n 
Bo(¥n) = fy dy =" 


Hence, Eg([n + 1]Y,/n) = 0, which means that (n+ 1)Y,,/n is an unbiased estimator of 6. 


15. (a) f1/+f2|¢)=@8(0+0-90)] =6, 
f(4|0)+f6|0)=(-9)7[0+ (1 -6)) = (1-4), 
f(3| 6) = 20(1 — @). 
The sum of the five probabilities on the left sides of these equations is equal to the sum the right 
sides, which is 


6? + ere eae 


(b) E 3 6-(x) f (2 | 0) = 1-08 + (2 — 2c)62(1 — A) + (c)20(1 — 6) + (1 — 2c)a(1 — 6)? + 


It will be ad that the sum of the coefficients of 6° is 0, the sum of the coefficients of 0? is 0, 
the sum of the coefficients of @ is 1, and the constant term is 0. Hence, Eg/d-(X)] = 0. 
(c) For every value of c, 
Varo, (dc) = Eo, (52) am [Ee (5e)]? = Eg, (2) ae 
Hence, the value of ¢ for which Varg,(6-) is a minimum will be the value of c for which Eg, (52) is 
a minimum. Now 
Eo)(52) = (1)?69 + (2 — 2c)?0G(1 — 49) + (c)?24o(1 — A) 
+(1 — 2c)?@9(1 — 09)? +0 
= 2c?[262(1 — 09) + 40(1 — 40) + 200(1 — A0)7| 
—4c[202(1 — 89) + 00(1 — 9)?] + terms not involving c. 
After further simplification of the coefficients of c? and c, we obtain the relation 
E,(62) = 609(1 — 0)? + 400(1 — 62)e+ terms not involving c. 
By differentiating with respect to c and setting the derivative equal to 0, it is found that the value 


of c for which Eg,(62) is a minimum is c = (1 + 00)/3. 


16. The unbiased estimator in Exercise 3 is 


For the observed values X; = 2 and X2 = —1, we obtain the value —2 for the estimate. This is 
unacceptable. Because [E(X)]? > 0, we should demand an estimate that is also nonnegative. 
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8.8 Fisher Information 


Commentary 


Although this section is optional, it does contain the interesting theoretical result on asymptotic normality 
of maximum likelihood estimators. It also contains the Cramér-Rao inequality, which can be useful for 
finding minimum variance unbiased estimators. However, the material is really only suitable for a fairly 
mathematically oriented course. 


Solutions to Exercises 


it 
1 1 
fG\e) = sr {sa -n), 
1 - 1 = 
Pel) = eS ox {Soe - 0)?} = Ste |W, 
oy] 
fen) = e —) “a fel) 
Therefore, 
[i fel wae = [= wile | way = SEX ~ 1) = 0, 
and 
"(a | )d _E(X-w)") 1_e 1 _, 
As bee ot a2 ot G2 
2. The p.f. is 


f(a\p) _ pil =p)", for 7 = 0, 1, Sienals 
The logarithm of this is log(p) + xlog(1 — p), and the derivative is 


1 x 


p 1-p 


According to Eq. (5.5.7) in the text, the variance of X is (1 — p)/p?, hence the Fisher information is 


exp(—0)0” 


8 
DS 


(x | @) 

( ) = -O+ log — log(z!), 
N(x |@) = -1+ a 

(2|@) = —5. 
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Therefore, by Eq. (8.8.3), 


E(X) 1 
1(0) = —Fy[X"(X | 6)] = c =i 
4, 
ga 
phere) 270 “0-3 :} 
se 
A(z|o) = —logo — , + const. 
! _ 1 2? 
A(é| a). = = a a 
1 3a" 
bat: a) — Pr) = 
Therefore, 
1 | 3E(X*) i, & 2 
a) alr ar a 


5. Let v =o”. Then 


a 
= eee = 
F(a |v) Qa ry 2 
a2 
Malv) = —=logv —-— +const., 
Vv 
1 ee 
/ — ——— 
ae) Qv 2v?’ 
1 i" 
" _ 
Therefore, 
1 1 1 
I(o*) = Iv) = -B,N"(X |) =-s5 + GS = 55 = 


6. Let g(x | w) denote the p.d.f. or the p.f. of X when y is regarded as the parameter 
fx | »(u)]. Therefore, 


log g(x | 4) = log f[x | b(u)] = Alz | o(H)], 
and 


5 log oa | w) = fe | bw)! (W). 


It now follows that 


. Then g(x | w) = 


2 
L(y) = Ey - log g(X | 1) = [Ww PE (AEX | wH))P) = [YH Pole) 
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7. We know that E(X,;,) =p and Var(X,,) = p(1 — p)/n. It was shown in Example 8.8.2 that I(P) = 


10. 


11. 


1/[p(1 — p)|. Therefore, Var(X,,) is equal to the lower bound 1/[nJ(p)] provided by the information 
inequality. 


. We know that E(X,) =p and Var (X,) = a It was shown in Example 8.8.3 that I(u) = 1/o?. 


Therefore, Var (X;,) is equal to the lower bound 1/[nJ()] provided by the information inequality. 


. We shall attack this exercise by trying to find an estimator of the form c|X| that is unbiased. One 


approach is as follows: We know that X?/o? has the y? distribution with one degree of freedom. 
Therefore, by Exercise 11 of Sec. 8.2, |X|/o has the y distribution with one degree of freedom, and it 
was shown in that exercise that 


e (1) yen). . 2 


Oo 


~ 1/2) Va 


Hence, E(|X|) = 0./2/7. It follows that E(|X|/7a/2) =o. Let 6 = |X|\V/7/2. Then 
us T 
E(6*) = ~E(|X|?) = —o?. 
(8) = ZB(IXP) = Zo 
Hence, 


Var 6 = B(6?) — [B(6))? = 50? — 0? = (5 = 1) o, 


Since 1/I(c) = 07/2, it follows that Var(d) > 1/I(c). 
Another unbiased estimator is 6)(X) = V2a X if X >0 and 6;(X) =0 if X <0. However, it can 


be shown, using advanced methods, that the estimator 6 found in this exercise is the only unbiased 
estimator of o that depends on X only through |X|. 


If m(o) = logo, then m/(c) = 1/o and [m!(c)|? = 1/7. Also, it was shown in Exercise 4 that 
I(c) =2/o7. Therefore, if T is an unbiased estimator of log a, it follows from the relation (8.8.14) that 


1 


nn 


Var(T) > - 
If f(x | 0) = a(@)b(x) exp[c(@)d(x)], then 
A(x | 8) = log a(@) + log b(x) + c(6)d(x) 


and 


rN @)= 
I herefore, 


V(X | 0) = So (X | 8) = no + (0) S-d(X;). 
i=l i=1 


If we choose 
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12. 


13. 


14. 


15. 


16. 
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then Eq. (8.8.14) will be satisfied with T = sal d(X;). Hence, this statistic is an efficient estimator of 


i=1 
its expectation. 


Let 6 = co? denote the unknown variance. Then 


fle | 8) = exp {Glen}. 
2 


This p.d.f. f(x | 6) has the form of an exponential family, as given in Exercise 12, with d(x) = (x — 1)°. 


Therefore, T = SOX — 1)? will be an efficient estimator. Since E[(X;— )?] =o? for i=1,...,n, 
i=1 

then E(T) =no?. Also, by Exercise 17 of Sec. 5.7, E[(X; —)*] = 304 for i=1,...,n. Therefore, 

Var[(X; — p)2] = 304% — ot = 20%, and it follows that Var(T’) = 2no*. 


It should be emphasized that any linear function of T will also be an efficient estimator. In particular, 
T/n will be an efficient estimator of 0. 


The incorrect part of the argument is at the beginning, because the information inequality cannot be 
applied to the uniform distribution. For each different value of 0, there is a different set of values of x 
for which f(x | @) > 0. 


(e7 


f(z|o) = oye | exp(—Bz), 


Mala) = alogG—logI(a) + (a—1)log2 — Gz, 
(a) 


N(x la) = Ta) 

" _ Te") - la r 

pee’ Map 
Therefore 


The distribution of the M.L.E. of a will be approximately the normal distribution with mean a@ and 
variance 1/[nI(a)]. 

It should be noted that we have determined this distribution without actually determining the M.L.E. 
itself. 


We know that the M.L.E. of jis f=, and, from Example 8.8.3, that I(:) = 1/07. The posterior dis- 
tribution of yu will be approximately a normal distribution with mean fi and variance 1/[nI(ji)| = 0? /n. 


We know that the M.L.E. of p is p=Z, and, from Example 8.8.2, that I(p) = 1/[p(1 — p)]. The 
posterior distribution of p will be approximately a normal distribution with mean p and variance 


1/[nI(p)| = Fr(1 — Fn)/n. 
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17. The derivative of the log-likelihood with respect to p is 


XN (ax\p) = — es ( + xlog(p) + (n — x) log(1 7) = - - Ta = aaa) 


The mean of \’(X|p) is clearly 0, so its variance is 


Var(X) on 
p(l—p)?  p(l—p) 


18. The derivative of the log-likelihood with respect to p is 


I(p) = 


/ 0 + -—1 r 
Meo) = 5 [le ( ) + roe) + rtg(t — 9) 738 a 


The mean of \’(X|p) is clearly 0, so its variance is 


7 p* Var(X) _ r 
Hp) = rip? pd—p) 
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Solutions to Exercises 


1. According to Exercise 5 in Sec. 8.8, the Fisher information I(7) based on a sample of size 1 is 1/[20%]. 
According to the information inequality, the variance of an unbiased estimator of o? must be at least 
n 


2o04/n. The variance of V = 5, X?/n is Var(X?)/n. Since X?/o? has a x? distribution with 1 degree 
i=1 

of freedom, its variance is 2. Hence Var(X?) = 20+ and Var(V) equals the lower bound from the 

information inequality. E(V) = E(X?) = 07, so V is unbiased. 


2. The ¢ distribution with one degree of freedom is the Cauchy distribution. Therefore, by Exercise 18 of 
Sec. 5.6, we can represent the random variable X in the form X = U/V, where U and V are independent 
and each has a standard normal distribution. But 1/X can then be represented as 1/X = V/U. Since 
V/U is again the ratio of independent, standard normal variables, it follows that 1/X again has the 
Cauchy distribution. 


3. It is known from Exercise 18 of Sec. 5.6 that U/V has a Cauchy distribution, which is the t distribution 
with one degree of freedom. Next, since |V| = (V?)!/?, it follows from Definition 8.4.1 that U/|V| has 
the required t distribution. Hence, by the previous exercise in this section, |V|/U will also have this t 
distribution. Since U and V are i.id., it now follows that |U|/V must have the same distribution as 
|V|/U. 


4. It is known from Exercise 5 of Sec. 8.3 that X; + X2q and X; — X92 are independent. Further, if we let 


Yi = —— (X, + X2) and Y — (Xx, _ X2), 


1 
7 20 20 
then Y; and Y have standard normal distributions. It follows, therefore, from Exercise 18 of Sec. 5.6 
that Y;/Y2 has a Cauchy distribution, which is the same as the t distribution with one degree of freedom. 
But 
Yi Xy+Xe 
% X=’ 
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so the desired result has been established. This result could also have been established by a direct 
calculation of the required p.d.f. 


. Since X; has the exponential distribution with parameter {, it follows that 23 X; has the exponential 


distribution with parameter 1/2. But this exponential distribution is the x? distribution with 2 degrees 
of freedom. Therefore, the sum of the iji.d. random variables 26X; (i = 1,...,n) will have a x? 
distribution with 2n degrees of freedom. 


. Let 6, be the proportion of the n observations that lie in the set A. Since each observation has 


probability 6 of lying in A, the observations can be thought of as forming n Bernoulli trials, each with 


probability 6 of success. Hence, F(6,,) = 6 and Var(6,,) = 0(1 — 6)/n. 


. (a) E(aS% + BS2) = a(m — 1)o? + B(n — 1)20?. 


Hence, this estimator will be unbiased if a(m — 1) + 28(n—1) =1. 
(b) Since $3 and S%. are independent, 


Var (aS% + BS%) = a? Var (S%) + B?var(S?) 


a*[2(m — 1)0*] + 6?[2(n — 1) - 40%] 
2a4|(m — 1)a? + 4(n — 1)8?]. 


Therefore, we must minimize 
A=(m=1)o* +4 n= 1)2" 


subject to the constraint (m — 1)a+2(n —1)G =1. If we solve this constraint for 6 in terms of a, 
and make this substitution for $ in A, we can then minimize A over all values of a. The result is 


1 
= ———— and, hence, 6 = ——————. 
a WE hue ence, 8 oe ae) 


. Xn41—Xn has the normal distribution with mean 0 and variance (1+ 1/n)o?. Hence, the distribution 


of (n/[n + 1])'/?(Xn41 —Xn)/o is a standard normal distribution. Also, nT?/o? has an independent 
x? distribution with n —1 degrees of freedom. Thus, the following ratio will have the ¢ distribution 
with n — 1 degrees of freedom: 


2 xi _ 

(—,) (Anti — Xn)/o n—1\¥? Xnii — Xn 
a 2 
| 


It can now be seen that k = ({n — 1]/[n + 1])'/?. 


. Under the given conditions, Y/(2c) has a standard normal distribution and $?/o? has an independent 


x? distribution with n—1 degrees of freedom. Thus, the following random variable will have a t 
distribution with n — 1 degrees of freedom: 


Y/(2c) Y/2 


— -3 


{S2/[P(n-1]PHPR 


where o! = [S?2/(n —1)]'/?. 


10. 


11. 


12. 


13. 


14. 
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As found in Exercise 3 of Sec. 8.5, the expected squared length of the confidence interval is E(L”) = 
4c?a?/n, where c is found from the table of the t distribution with n —1 degrees of freedom in the back 
of the book under the .95 column (to give probability .90 between —c and c). We must compute the 
value of 4c?/n for various values of n and see when it is less than 1/2. For n = 23, it is found that 
C29 = 1.717 and the coefficient of o? in E(L7) is 4(1.717)?/23 = .512. For n = 24, c93 = 1.714 and the 
coefficient of a? is 4(1.714)?/24 = .490. Hence, n = 24 is the required value. 


Let c denote the .99 quantile of the t distribution with n — 1 degrees of freedom; i.e., Pr(U < c) = .99 
Ye, _ 
n nH) 


if U has the specified t distribution. Therefore, Pr ; < ‘ = .99 or, equivalently, 
or 


/ 


=> co 


| = .99. Hence, L = Xp — co! /n/?. 


Let c denote the .01 quantile of the x? distribution with n — 1 degrees of freedom; i.e., Pr(V < c) = .01 
if V has the specified y? distribution. Therefore, 


S2 
Pr (3 > .) = .99 


or, equivalently, 
Pr (a? < $2/c) = .99. 


Hence, U = S?/c. 


(a) The posterior distribution of @ is the normal distribution with mean jp, and variance v?, as given 


by (7.3.1) and (7.3.2). Therefore, under this distribution, 
Pr(y — 1.961. < 6 < py + 1.961) = .95. 


This interval J is the shortest one that has the required probability because it is symmetrically 
placed around the mean ju; of the normal distribution. 


(b) It follows from (7.3.1) that p> Zp, as v? —> oo and from (7.3.2) that v? > o?/n. Hence, the 
interval J converges to the interval 


1.960 _ 1.960 
In — ap) <@<Gn+— ae: 
It was shown in Exercise 4 of Sec. 8.5 that this interval is a confidence interval for 0 with confidence 
coefficient .95. 


(a) Since Y has a Poisson distribution with mean n0, it follows that 


iene = > CeO" sae) Soe 
y=0 y: y=0 y: 


exp(—n8@) exp|[n@ exp(—c)] = exp(n@[exp(—c) — 1)). 
Since this expectation must be exp(—8), it follows that n(exp(—c) — 1) = — 1 or c = log[n/(n—1)]. 
(b) It was shown in Exercise 3 of Sec. 8.8 that J(@) = 1/0 in this problem. Since m(@) = exp(—@), 
[m'(0)|? = exp(—26). Hence, from Eq. (8.8.14), 
6 exp(—26) 


Var(exp(—cY)) > —— 
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15. 


16. 


17. 


18. 


19. 


20. 
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In the notation of Sec. 8.8, 


NM«|A) = log?+(0—1)logz, 
1 

N(z|0) = a + log x, 

N(e| 0) = 1/6. 


Hence, by Eq. (8.8.3), I(@) = 1/6? and it follows that the asymptotic distribution of 


f(|0) = 0 *exp(—2/8), 
A(z |@) = —logd-=/8, 
N(a2|0) = stm 
N10) = aan 
Therefore, 
1 


1(8) = —Eo[X"(X|6)] = ay. 


If m(p) = (1 —p)?, then m’(p) = —2(1 — p) and [m’(p)]? = 4(1 — p)?. It was shown in Example 8.8.2 
that I(p) = 1/[p(1 — p)]. Therefore, if T is an unbiased estimator of m(p), it follows from the relation 
(8.8.14) that 


ns Oa el) 


n n 

f(a|B) = Bexp(—S8x). This p.d.f. has the form of an exponential family, as given in Exercise 11 of 

Sec. 8.8, with d(x) = x. Therefore, T = S- X; will be an efficient estimator. We know that E(X;) = 1/8 
i=1 

and Var (X;) = 1/87. Hence, E(T) =n/® and Var(T) = n/8?. 


Since any linear function of T will also be an efficient estimator, it follows that X,, = T/n will be an 
efficient estimator of 1/3. As a check of this result, it can be verified directly that Var(X,) = 1/[n8?] = 
[m’(8)]?/[nI(B)], where m(8) = 1/8 and I(8) was obtained in Example 8.8.6. 


It was shown in Example 8.8.6 that J(@) = 1/8”. The distribution of the M.L.E. of 8 will be approxi- 
mately the normal distribution with mean (6 and variance 1/[nI(()]. 


(a) Let a(8) =1/8. Then a’(8) = —1/6?. By Exercise (19, it is known that Bn is approximately 
normal with mean § and variance 8?/n. Therefore, 1/8, will be approximately normal with mean 
1/8 and variance [a‘(8)]?(8?/n) = 1/(n6?). Equivalently, the asymptotic distribution of 


(n6*)"/? (1/6, — 1/8) 


is standard normal. 
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(b) Since the mean of the exponential distribution is 1/6 and the variance is 1/ B?, it follows directly 
from the central limit theorem that the asymptotic distribution of X;, = 1/8, is exactly that found 
in part (a). 


21. (a) The distribution of Y is the Poisson distribution with mean n@. In order for r(Y) to be an unbiased 
estimator of 1/0, we need 


1 = By(r(¥)) = Yo r(w)exp(-no) 
y=0 se 


This equation can be rewritten as 


exp(n@) = SS Myr gut, (S.8.8) 


yo 


The function on the left side of (S.8.8) has a unique power series representation, hence the right 
side of (S.8.8) must equal that power series. However, the limit as 6 — 0 of the left side of (5.8.8) 
is 1, while the limit of the right side is 0, hence the power series on the right cannot represent the 
function on the left. 


(b) E(n/[Y + 1]) = S> nexp(—nd) [nO]¥/(y +1)!. By letting u = y+1 in this sum, we get n[1 — 
y=0 
exp(—n0@)|/[n6] = 1/0 — exp(—n6)/0. So the bias is exp(—n@)/0. Clearly exp(—n@) goes to 0 as 
n> wo. 

(c) n/(1+Y) =1/(X,+1/n). We know that X,, + 1/n has approximately the normal distribution 
with mean 6+ 1/n and variance 6/n. We can ignore the 1/n added to @ in the mean since this 
will eventually be small relative to 6. Using the delta method, we find that 1/(X, +1/n) has 
approximately the normal distribution with mean 1/0 and variance (1/67)?0/n = (n@3)~1. 


22. (a) The p.d-f. of Y, is 


n—-l/pn ; 
J ag fe" TO ys 8, 
Flyl®) = 0 otherwise. 


This can be found using the method of Example 3.9.6. If X = Y,,/0, then the p.d.f. of X is 


ngert if 0 < x < i 
g(x|0) = f(x0|0)8 = 0 otherwise. 


Notice that this does not depend on 6. The c.d.f. is then G(x) = x” for 0 < « < 1. The quantile 
function is G~!(p) = p/”. 


(b) The bias of Y;, as an estimator of 0 is 


ny”! 0 


6 


(c) The distribution of Z = Y,,/0 has p.d-f. 


nz”! for0<z< 1, 
g(z) = Of (26/0) = 0 otherwise 


where f(-|@) comes from part (a). One can see that g(z) does not depend on @, hence the distri- 
bution of Z is the same for all @. 
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(d) We would like to find two random variables A(Y,,) and B(Y,,) such that 


Pr(A(y,,) <@ < BY,,)) =, for all @. (S.8.9) 
This can be arranged by using the fact that Y,,/0 has the c.d.f. G(x) = x” for all 6. This means 
that 

Pr(a< <0) =b"—a", 


for all 6. Let a and b be constants such that 6” — a” = y (eg., b = ({1 + y]/2)!/" and a = 
({1 — y]/2)'/”). Then set A(Y;,) = Y,/b and B(Y;) = Y,/a. It follows that (S.8.9) holds. 


Chapter 9 


Testing Hypotheses 


9.1 Problems of Testing Hypotheses 


Commentary 


This section was augemented in the fourth edition. It now includes a general introduction to likelihood ratio 
tests and some foundational discussion of the terminology of hypothesis testing. After covering this section, 
one could skip directly to Sec. 9.5 and discuss the ¢ test without using any of the material in Sec. 9.2—9.4. 
Indeed, unless your course is a rigorous mathematical statistics course, it might be highly advisable to skip 
ahead. 


Solutions to Exercises 


1. (a) Let 6 be the test that rejects Hyp when X > 1.The power function of 6 is 
m(8|d) = Pr(X > 1|6) = exp(—f), 
for 6 > 0. 
(b) The size of the test 6 is supgs;7(6|6). Using the answer to part (a), we see that 7(6|d) is a 


decreasing function of (, hence the size of the test is 7(1|6) = exp(—1). 


2. (a) We know that if0<y< 0, then Pr(Y, < y) = (y/0)”. Also, if y > 0, then Pr(Y, < y) 
Therefore, if @ < 1.5, then a(@) = Pr(Y, < 1.5) = 1. If @ > 1.5, then 7(@) = Pr(Y, < 1.5) 
(1.5 /8)". 


(b) The size of the test is 
ce eC 
KD NAP 
3. (a) For any given value of p,a(p) = Pr(Y > 7)+Pr(Y < 1), where Y has a binomial distribution with 


parameters n = 20 and p. For p=0,Pr(Y > 7) =0 and Pr(Y < 1) = 1. Therefore, 7(0) = 1. For 
p = 0.1, it is found from the table of the binomial distribution that 


1.5 
a = sup 7(6) = sup (=) 
@>2 g>2 \ 8 


Pr(Y > 7) = .0020 + .0003 + .0001 + .0000 = .0024 


and Pr(Y < 1) = .1216+.2701 = .3917. Hence, 7(0.1) = 0.3941. Similarly, for p = 0.2, it is found 
that 


Pr(Y > 7) = .0545 + .0222 + .0074 + .0020 + .0005 + .0001 = .0867 
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and Pr(Y < 1) = .0115 + .0576 = .0691. Hence, 7(0.2) = 0.1558. By continuing to use the tables 
in this way, we can find the values of 7(0.3), (0.4), and 7(0.5). For p = 0.6, we must use the 
fact that if Y has a binomial distribution with parameters 20 and 0.6, then Z = 20—Y has a 
binomial distribution with parameters 20 and 0.4. Also, Pr(Y > 7) = Pr(Z < 13) and Pr(Y < 
1) = Pr(Z > 19). It is found from the tables that Pr(Z < 13) = .9935 and Pr(Z > 19) = .0000. 
Hence, 7(0.6) = .9935. Similarly, if p = 0.7, then Z = 20—Y will have a binomial distribution with 
parameters 20 and 0.3. In this case it is found that Pr(Z < 13) = .9998 and Pr(Z > 19) = .0000. 
Hence, 7(0.7) = 0.9998. By continuing in this way, the values of 7(0.8), 7(0.9), and 7(1.0) = 1 
can be obtained. 


(b) Since Hp is a simple hypothesis, the size a of the test is just the value of the power function at 
the point specified by Hp. Thus, a = 7(0.2) = 0.1558. 


. The null hypothesis Ho is simple. Therefore, the size a of the test is a = Pr(Rejecting Ho | u = po). 


When jp = pio, the random variable Z = n‘/?(X,, — po) will have the standard normal distribution. 
Hence, since n = 25, 


a = Pr(|Xn — fo] > c) = Pr(|Z| > 5c) = 2[1 — O(5c)). 


Thus, a = 0.05 if and only if ®(5c) = 0.975. It is found from a table of the standard normal distribution 
that 5c = 1.96 and c = 0.392. 


. A hypothesis is simple if and only if it specifies a single value of both and o. Therefore, only the 


hypothesis in (a) is simple. All the others are composite. In particular, although the hypothesis in (d) 
specifies the value of py, it leaves the value of o arbitrary. 


. If Ho is true, then X will surely be smaller than 3.5. If Hy is true, then X will surely be greater than 


3.5. Therefore, the test procedure which rejects Ho if and only if X > 3.5 will have probability 0 of 
leading to a wrong decision, no matter what the true value of @ is. 


. Let C be the critical region of Y, values for the test 6, and let C* be the critical region for 6*. It is 


easy to see that C* C C. Hence 
(65) — 7(6|5*) = Pr (Yn € CN (C*)| 8). 

Here C 1 (C*)© = [4, 4.5], so 
(0/5) — 7(6|5*) = Pr(4 < Yq < 4.5]0). (8.9.1) 


(a) For 0 < 4 Pr(4 < Y,,|@) = 0, so the two power functions must be equal by (S.9.1). 
(b) For 6 > 4, 


(min{0,4.5})”" — 4” 


Pr(4 < Yn < 4.5|0) = = 


> 0. 


Hence, 7(6|6) > 2(6|6*) by (S.9.1). 


(c) The only places where the power functions differ are for 9 > 4. Since these values are all in 1, 
it is better for a test to have higher power function for these values. Since 6 has higher power 
function than 6* for all of these values, 6 is the better test. 
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8. (a) The distribution of Z given py is the normal distribution with mean n!/?(u— jg) and variance 1. 
We can write 
Pr(Z > els) =1— B([e— ns jo) = (np — Pig — 0). 


1/2 


Since ® is an increasing function and n!/2u — n4/?49 — c is an increasing function of 1, the power 


function is an increasing function of w. 


(b) The size of the test will be the power function at = fg, since jug is the largest value in Qo and 
the power function is increasing. Hence, the size is ®(—c). If we set this equal to ag, we can solve 
for c= —@~!(ap). 


9. A sensible test would be to reject Ho if X, < c’. So, let T = yp — Xn. Then the power function of the 
test 6 that rejects Ho when T > c is 
n(uld) = Pr(T > ely) 
= Pr(Xn < Mo — ep) 
= (Vn —c- pI). 


Since ® is an increasing function and \/n[jo — c — py] is a decreasing function of jy, it follows that 
®(./n|u9 — c — pi) is a decreasing function of p. 


10. When Z = z is observed, the p-value is Pr(Z > z|i9) = ®(n!/? [ug — 2). 


11. (a) For cy > 2, Pr(Y < c1|p = 0.4) > 0.23, hence c, < 1. Also, for cp < 5, Pr(Y > cg|p = 0.4) > 0.26, 
hence cz > 6. Here are some values of the desired probability for various (c1,c2) pairs 


C6 c2 | Pr(Y < a|p=0.4) + Pr(Y > colp = 0.4) 


1 6 0.1699 
1 7 0.0956 
6 0.1094 

—-l1 6 0.0994 


So, the closest we can get to 0.1 without going over is 0.0994, which is achieved when c, < 0 and 
c= 6. 

(b) The size of the test is 0.0994, as we calculated in part (a). 

(c) The power function is plotted in Fig. $.9.1. Notice that the power function is too low for values of 
p<0.4. This is due to the fact that the test only rejects Hp when Y > 6 A better test might be 
one with c; = 1 and cz = 7. Even though the size is slightly smaller (as is the power for p > 0.4), 
its power is much greater for p < 0.4. 


12. (a 


— 


The power function of 6, is 
0 dx lfm 
™(O|dc) r(X > cl) | mit (Oy (2 arctan(c — @) 


Since arctan is an increasing function and c — @ is a decreasing function of 0, the power function 
is increasing in 0. 
(b) To make the size of the test 0.05, we need to solve 
1 fa 
0.05 = — |= — arctan(c — 0 
a E ( 0| ’ 


for c. We get 
c= 6) + tan(0.457) = 0) + 6.314. 
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Power 


Figure $.9.1: Power function of test in Exercise 11c of Sec. 9.1. 


(c) The p-value when X = z is observed is 
1 
Pr(X > z|0 = 0) = — - —arctan(x — 69) | . 
7 


13. For c = 3, Pr(X > cl@ = 1) = 0.0803, while for c = 2, the probability is 0.2642. Hence, we must use 
c= 3. 


14. (a) The distribution of X is a gamma distribution with parameters n and 6 and Y = X@ has a 
gamma distribution with parameters n and 1. Let G,, be the c.d.f. of the gamma distribution with 
parameters n and 1. The power function of 6, is then 


m(0|6-) = Pr(X > cl0) = Pr(Y > cO|0) = 1—- G,(c8). 


Since 1 — G,, is an decreasing function and c@ is an increasing function of 0, 1 — G,(c@) is a 
decreasing function of 6. 


(b) We need 1 — G,,(c09) = ag. This means that c = G7 !(1 — ag)/O. 


(c) With ag = 0.1, n = 1 and 6 = 2, we find that G,(y) = 1 — exp(—y) and G;!(p) = —log(1 — p). 
So, c = — log(0.1)/2 = 1.151. The power function is plotted in Fig. $.9.2. 


15. The p-value when X = z is observed is the size of the test that rejects Hp when X > x, namely 


0 ifa>1, 
Pr(X > ald =1) ={ l-a if0<2<1. 


nm 
16. The confidence interval is (s2/c2,s2/c,), where s? = Sa —p,)* and cj,cg are the (1 — y)/2 and 
i=1 
(1 + y)/2 quantiles of the x? distribution with n — 1 degrees of freedom. We create the test 5. of 
Ho : 0? =c by rejecting Hp if c is not in the interval. Let T(a) = s? and notice that c is outside of the 
interval if and only if T(a) is not in the interval (cic, c2c). 


17. We need q(y) to have the property that Pr(q(Y) < plp) > y for all p. We shall prove that q(y) 
equal to the smallest po such that Pr(Y > y|p = po) > 1—7 satisfies this property. For each p, let 
Ap = {y: q(y) < p}. We need to show that Pr(Y € A,|p) > y. First, notice that q(y) is an increasing 


18. 


19. 
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Power 


Figure $.9.2: Power function of test in Exercise 14c of Sec. 9.1. 


function of y. This means that for each p there is yp, such that A, = {0,..., yp}. So, we need to show 
that Pr(Y < yp|p) > y for all p. Equivalently, we need to show that Pr(Y > yp|p) < 1—. Notice that 
Yp is the largest value of y such that q(y) < p. That is, yp is the largest value of y such that there exists 
po <p with Pr(Y > y|po) > 1—vy. For each y, Pr(Y > y|p) is a continuous nondecreasing function of 
p. If Pr(Y > yp|p) > 1—v7, then there exists po < p such that 


1—y< Pr{Y > yp|po) = Pr(Y = yp + 1[po). 


This contradicts the fact that yp is the largest y such that there is po < p with Pr(Y > y|po) > 1-7. 
Hence Pr(Y > yp|p) < 1—7 and the proof is complete. 


Our tests are all of the form “Reject Ho if T > c.” Let 6, be this test, and define 


a(c) = sup Pr(T' > c|6), 
ENO 


the size of the test 6.. Then 6, has level of significance ag if and only if a(c) < ag. Notice that a(c) is 
a decreasing function of c. When T' = t is observed, we reject Ho at level of significance ag using 6, if 
and only if t > c, which is equivalent to a(t) < ag. Hence a(t) is the smallest level of significance at 
which we can reject Ho if T = t is observed. Notice that a(t) is the expression in Eq. (9.1.12). 


We want our test to reject Ho if X, < Y, where Y might be a random variable. We can write this as 
not rejecting Hp if X, > Y. We want X,, > Y to be equivalent to jg being inside of our interval. We 
need the test to have level ag, so 


Pr(X, < ¥ lv = wo, 27) =a (S.9.2) 
is necessary. We know that n!/?(X, — p9)/o’ has the t distribution with n — 1 degrees of freedom if 
[1 = fo, hence Eq. (8.9.2) will hold if Y = po — weer —ag). Now, Xn > Y if and only if 
jo < Xn +n '/20! TA —ag). This is equivalent to 1g in our interval if our interval is 


(00, Xn pa ea Tl a0)) : 
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20. Let 9) € Q, and let go = g(#). By construction, g(99) € w(X) if and only if d,, does not reject 
Hog : 9(8) < go. Given 6 = 6, the probability that 6,, does not reject Hog, is at least y because the 
null hypothesis is true and the level of the test is ag = 1 — y. Hence, (9.1.15) holds. 


21. Let U = n/?(X,, — po) /o' 


(a) We reject the null hypothesis in (9.1.22) if and only if 
U >T1,(1 — ap). (S.9.3) 

We reject the null hypothesis in (9.1.27) if and only if 
U <-T',(1 — ap). (S.9.4) 
With ag < 0.5, T7',(1 — a9) > 0. So, (S.9.3) requires U > 0 while (S.9.4) requires U < 0. These 


cannot both occur. 


(b) Both (8.9.3) and (S.9.4) fail if and only if U is strictly between —T7', (1 — ag) and Ty 1, (1 — ao). 
This can happen if X,, is sufficiently close to jug. This has probability 1 — 2aq > 0. 


(c) If ag > 0.5, then T~',(1 — ag) < 0, and both null hypotheses would be rejected if U is between 
the numbers T) lL ( — ag) <0 and —T71,(1 — ag) > 0. This has probability 2a9 — 1 > 0. 


9.2 Testing Simple Hypotheses 


Commentary 


This section, and the two following, contain some traditional optimality results concerning tests of hypotheses 
about one-dimensional parameters. In this section, we present the Neyman-Pearson lemma which gives 
optimal tests for simple null hypotheses against simple alternative hypotheses. It is recommended that one 
skip this section, and the two that follow, unless one is teaching a rigorous mathematical statistics course. 
This section ends with a brief discussion of randomized tests. Randomized tests are mainly of theoretical 
interest. They only show up in one additional place in the text, namely the proof of Theorem 9.3.1. 


Solutions to Exercises 


1. According to Theorem 9.2.1, we should reject Ho if fi(x) > fo(x), not reject Ho if fi(a) < fo(a) and 
do whatever we wish if f(x) = fo(x). Here 


0.3 if#=1, 
fo(z) = ee ifx=0, 
0.6 ifa=1, 
filz) = ee ifr —0. 


We have fi(x) > fo(x) if e = 1 and f(x) < fo(x) if = 0. We never have f;(x) = fo(x). So, the test 
is to reject Ho if X = 1 and not reject Ho if X = 0. 


2. (a) Theorem 9.2.1 can be applied with a = 1 and b = 2. Therefore, Ho should not be rejected if 
fi(x)/fo(a) < 1/2. Since fi (x)/fo(x) = 2x, the procedure is to not reject Hp if x < 1/4 and to 
reject Ho if x > 1/4. 
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(b) For this procedure, 


3 


1 
a(5) = Pr(Rej. Ho| fo) = | |, le) de = 5 


and 
1/4 1 
B(6) = Pr(Acc. Hol fi) = [ 22 dx = — 
0 16 
Therefore, a(d) + 26(6) = 7/8. 


(a) Theorem 9.2.1 can be applied with a = 3 and 6b = 1. Therefore, Hp should not be rejected if 
fi(x)/fo(a) = 2x < 3. Since all possible values of X lie in the interval (0,1), and since 2x < 3 for 
all values in this interval, the optimal procedure is to not reject Hg for every possible observed 
value. 


(b) Since Ho is never rejecte, a(6) = 0 and 6(6) = 1. Therefore, 3a(6) + 8(6) = 


(a) By the Neyman-Pearson lemma, Ho should be rejected if fi(x)/fo(x) = 2a > k, where k is chosen 
so that Pr(2x > k| fo) =0.1. For 0 <k < 2, 


S| fo) = 1-5. 


Pr(2X > k| fo) = Pr (X > 5 


If this value is to be equal to 0.1, then k = 1.8. Therefore, the optimal procedure is to reject Ho 
if 2x > 1.8 or, equivalently, if x > 0.9. 


(b) For this procedure, 
0.9 
B(6) = Pr(Acc. Ho| fi) = i fi(a)dx = 0.81. 
0 


(a) The conditions here are different from those of the Neyman-Pearson lemma. Rather than fixing 
the value of a(d) and minimizing 6(6), we must here fix the value of 8(6) and minimize a(0). 
Nevertheless, the same proof as that given for the Neyman-Pearson lemma shows that the optimal 
procedure is again to reject Ho if fi(X)/fo(X) > k, where k is now chosen so that 


B(6) = Pr(Acc. Ho | Hi) = pr | < k| | = 0.05. 
In this exercise, 
1 i 
fo(X) = (2r)r/2 os] 5 d(H 3 oF 
and 
1 1 
fi(X) = CE exp -5 2 (2 —5 07 
Therefore 
fi(X) _ 1 7 2 
log W(X) = 3 Sete 5 = 280) | 
= 5 ow ~ Tmt 2n— Sot +10) — 25n 
= nn — (const.). 
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It follows that the likelihood ratio f1(X)/fo(X) will be greater than some specified constant k 
if and only if Z, is greater than some other constant k’. Therefore, the optimal procedure is to 
reject Ho if Z, > k’, where k’ is chosen so that 


Pr(Xp < k’ | Hy) = 0.05. 


We shall now determine the value of k’. If Hy is true, then X,, will have a normal distribution 
with mean 5.0 and variance 1/n. Therefore, Z = \/n(X, — 5.0) will have the standard normal 
distribution, and it follows that 


Pr(Xn, < k' | Hy) = Pr[Z < J/n(k’ — 5.0)] = ®[/n(k’ — 5.0)). 
If this probability is to be equal to 0.05, then it can be found from a table of values of ® that 
/n(k! — 5.0) = —1.645. Hence, k’ = 5.0 — 1.645n-1/?. 
(b) For n = 4, the test procedure is to reject Ho if X, > 5.0 — 1.645/2 = 4.1775. Therefore, 
a(6) = Pr(Rej. Ho | Ho) = Pr(Xn > 4.1775 | Ho). 


When Hg is true, X;,has a normal distribution with mean 3.5 and variance 1/n = 1/4. Therefore, 


Z = 2(X,y, — 3.5) will have the standard normal distribution, and 
a(5) = Pr[Z > 2(4.1775 — 3.5)] = Pr(Z > 1.355) 
1 — ®(1.355) = 0.0877. 
6. Theorem 9.2.1 can be applied with a = b = 1. Therefore, Hp should be rejected if fi;(X)/fo(X) > 1. 


If we let y = par then 
i=1 


fi(X) = pil — pi)” 


and 

fo(X) = po(1 — po)”. 
Hence, 

fi(X) _ a = Poh" € =)" 

fo(X)  lpod—pi)l \1—po 
But fi(X)/fo(X) > 1 if and only if log[|fi(X)/fo(X)] > 0, and this inequality will be satisfied if and 
only if 

1- 1- 
obs Ee a + nog ( 21) si 
po(1 — pi) 1— po 


Since p, < po and 1—po < 1—py, the first logarithm on the left side of this relation is negative. Finally, 
if we let Z, = y/n, then this relation can be rewritten as follows: 


The optimal procedure is to reject Ho when this inequality is satisfied. 
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7. (a) By the Neyman-Pearson lemma, Hp should be rejected if fi(X)/fo(X) > k. Here, 


i=1 
and 
hij == epce 7 
1 _ (2r)n/23n/2 ae os 6 = ss ae as , 
Therefore, 
f(x) 1 2 
log =—) (a; — p)* + (const.). 
fo(X) 12 =i 


a 


nm 

It follows that the likelihood ratio will be greater than a specified constant & if and only if SG — 
i=1 

yu)” is greater than some other constant c. The constant c is to be chosen so that 


PE Sox =H) Se 


i=l 


1] = 0.05. 


n 

The value of c can be determined as follows. When Hp is true, W = SOX — 1)? /2 will have x? 
i=l 

distribution with n degrees of freedom. Therefore, 


Pr Sox =p) >e 


i=1 


i] =Pr(W> 5) 


If this probability is to be equal to 0.05, then the value of c/2 can be determined from a table of 
the y? distribution. 

(b) For n = 8, it is found from a table of the x? distribution with 8 degrees of freedom that c/2 = 15.51 
and c = 31.02. 


8. (a) The p.d.f.’s fo(x) and f(x) are as sketched in Fig. $.9.3. Under Hp it is impossible to obtain a 
value of X greater than 1, but such values are possible under H;. Therefore, if a test procedure 
rejects Ho only if x > 1, then it is impossible to make an error of type 1, and a(d) = 0. Also, 


e=re eiley= > 


A fo(x) 


1/2 


0 1 2 x 


Figure S.9.3: Figure for Exercise 8a of Sec. 9.2. 
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(b) To have a(6) = 0, we can include in the critical region only a set of points having probability 0 
under Ho. Therefore, only points x > 1 can be considered. To minimize {(0) we should choose 
this set to have maximum probability under H;. Therefore, all points x > 1 should be used in the 
critical region. 


9. As in Exercise 8, we should reject Ho if at least one of the n observations is greater than 1. For this 
test, a(d) = 0 and 


B66) = Pr(Ace. Hel) = Pris <1. KS 1) = (5) . 


10. (a) and (b). Theorem 9.2.1 can be applied with a = b = 1. The optimal procedure is to reject Ho if 
nm 
fi(X)/fo(X) > 1. If we let y = Si then for i = 0,1, 
i=1 


fi(X) = exp(=nAi)Ay_ 


TTii1 (ai!) 
Therefore, 
fi(X) _ AL\ _ 
og Fo(X) =y log (3) n(Az — Ao). 


Since A; > Ao, it follows that f,(X)/fo(X) > 1ifand only if%, = y/n > (Ai—Ao)/(log A1 —log Ao). 


(c) If H; is true, then Y will have a Poison distribution with mean nd;. For \9 = 1/4, A1 = 1/2, and 
n = 20, 


n(Ar — Ao) _ 20(0.25) 
log Ay —log Ag ~—-: 0.69314 
Therefore, it is found from a table of the Poison distribution with mean 20(1/4) = 5 that 
a(5) = Pr(Y > 7.214| Ho) = Pr(Y > 8| Ho) = 0.1333. 
Also, it is found from a table with mean 20(1/2) = 10 that 
(6) = Pr(Y < 7.214] Hy) = Pr(Y <7| Hy) = 0.2203. 
Therefore, a(d) + (6) = 0.3536. 


= 7.214. 


11. Theorem 9.2.1 can be applied with a = b = 1. The optimal procedure is to reject Ho if fi(X)/fo(X) > 
1. Here, 


and 


After some algebraic reduction, it can be shown that f;(X)/fo(X) > 1 if and only if 7, > 0. If 
Hp is true, X,, will have the normal distribution with mean —1 and variance 4/n. Therefore, Z = 
J/n(Xp, +1)/2 will have the standard normal distribution, and 


a(6) = Px(Xp > 0| Ho) = Pr(Z > vn) =1-6(Sva). 


12: 


13. 
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Similarly, if H, is true, X,,will have the normal distribution with mean 1 and variance 4/n. Therefore, 
Z! = V/n(X»p — 1)/2 will have the standard normal distribution, and 


8(6) = Pr(Xp <0| Hy) =Pr (2 2 -3Vi) =-1-6 (vn) . 


Hence, a(6) + 6(6) = 2[1 — &(,/n/2)]. We can now use a program that computes ® to obtain the 
following results: 


(a) If n = 1, a(6) + B(5) = 2(0.3085) = 0.6170. 
(b) Ifn =4, a(6) + B(5) = 2(0.1587) = 0.3173. 
(c) If n = 16, a(5) + 8(6) = 2(0.0228) = 0.0455. 
(d) If n = 36, a(6) + 8(6) = 2(0.0013) = 0.0027. 


Slight discrepancies appear above due to rounding after multiplying by 2 rather than before. 


In the notation of this section, f;(a) = 0? exp (-6; 2 xj) for i = 0,1. The desired test has the 
following form: reject Ho if fi(a)/fo(a) > & where k is chosen so that the probability of rejecting Ho 
is ag given 0 = 0p. The ratio of f; to fo is 


Since 9) < 61, the above ratio will be greater than k if and only if >>, 2; is less than some other 
constant, c. That c is chosen so that Pr (37, X; < c|@ = 00) = ao. The distribution of 7, X; given 
6 = 09 is the gamma distribution with parameters n and 69. Hence, c must be the ap quantile of that 
distribution. 
(a) The test rejects Ho if fo(X) < f12(X). In this case, fo(a) = exp(—[x1 + x2]/2)/4, and fi(x) = 
4/(2+21 +22)? for both x; > 0 and x2 > 0. Let T = X; + Xo. Then we reject Ho if 
exp(—T/2)/4 < 4/(2+T)?. (S.9.5) 
(b) If X; = 4 and X2 =3 are observed, then T = 7. The inequality in (S.9.5) is exp(—7/2)/4 < 4/9° 
or 0.007549 < 0.00549, which is false, so we do not reject Ho. 
(c) If Ho is true, then T is the sum of two independent exponential random variables with parameter 
1/2. Hence, it has the gamma distribution with parameters 2 and 1/2 by Theorem 5.7.7. 
(d) The test is to reject Ho if f;(X)/fo(X) > c, where c is chosen so that the probability is 0.1 that 
we reject Ho given 0 = 6). We can write 
fi(X) _ 16 exp(T’/2) 
fo(X)  (2+T)° 
The function on the right side of (S.9.6) takes the value 2 at T = 0, decreases to the value 
0.5473 at T = 4, and increases for T > 4. Let G be the c.d.f. of the gamma distribution with 
parameters 2 and 1/2 (also the y? distribution with 4 degrees of freedom). The level 0.01 test 
will reject Ho if T < cy or T > cp where c, and cy satisfy G(c,) + 1 — G(ce) = 0.01, and either 
16 exp(c,/2)/(2+c¢1)? = 16 exp(c2/2)/(2+c2)? or c, = 0 and 16 exp(c2/2)/(2+c2)° > 2. It follows 
that 1— G(c2) < 0.01, that is, cg > G~'(0.99) = 13.28. But 
16 exp(13.28)/(2 + 13.28)? = 3.4 > 2. 


It follows that cy = 0 and the test is to reject Ho if T > 13.28. 
(e) If X; = 4 and X2 = 3, then T = 7 and we do not reject Ho. 


(8.9.6) 


284 Chapter 9. Testing Hypotheses 


9.3. Uniformly Most Powerful Tests 


Commentary 


This section introduces the concept of monotone likelihood ratio, which is used to provide conditions under 
which uniformly most powerful tests exist for one-sided hypotheses. One may safely skip this section if one 
is not teaching a rigorous mathematical statistics course. One step in the proof of Theorem 9.3.1 relies on 
randomized tests (Sec. 9.2), which the instructor might have skipped earlier. 


Solutions to Exercises 


n 
1. Let y= aes Then the joint p.f. is 
i=1 


exp(—nA)A¥ 


YS Tee) 


Therefore, for 0 < Ay < Xo, 


eras : . = exp(—n(A2 — a1) (=)" | 


which is an increasing function of y. 


n 
2. Let y = Gi —.)*. Then the joint p.d-f. is 
i=l 


2) _ 1 ss 
In(X |o*) = (any) Fon OP 3q3) 


Therefore, for 0 < of < 03, 
fr(X 03) _ of 1fi_ 1 
Fe (KXlq2) qn XP\5\ 7 a] YP 
fr(X loz) a3 2\o, 9% 
which is an increasing function of y. 


nm n 
3. Let y= iH x; and let z= Xo me Then the joint p.d-f. is 
i=1 i=1 


fn(X |a) = yo? exp(—6z). 


[T'(a)]” 
Therefore, for 0 < ay < ag, 


fn(X | a2) 
fn(X | 01) 


A2—-A1 


= (const.)y ; 


which is an increasing function of y. 
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4. The joint p.d.f. f,(X |) in this exercise is the same as the joint p.d.f. f,(X |a@) given in Exercise 3, 
except that the value of 6 is now unknown and the value of a is known. Since z = n¥p, it follows that 
for 0 < By < Ba, 


Jn(X|P2) = (const.) ex _ nx 
(XB) >| t.) exp([81 — Bo|nZ). 


The expression on the right side of this relation is a decreasing function of %,, because 6, — 82 < 0. 
Therefore, this expression is an increasing function of —Z,. 


n 
5. Let y = S- d(x;). Then the joint p.d.f. or the joint p.f. is 
i=1 


fn(X |) = [a(8)]” TI 7) exp[c(9)y]. 
i=1 


Therefore, for 0; < 69, 


fn(X | 02) _ = 
fr(X 1) La(1) 


Since c(@2) — c(@,) > 0, this expression is an increasing function of y. 


li exp{[e(62) — c(61)ly}. 


6. Let 0; < 09. The range of possible values of r(X) = max{X1,...,X,} is the interval [0,602] when 
comparing 9; and 62. The likelihood ratio for values of r(a) in this interval is 


aL 
0 
coo if, < r(a) < Oo. 


This is monotone increasing, even though it takes only two values. It does take the larger value oo 
when r(a) is large and it takes the smaller value 07/03 when r(a) is small. 


7. No matter what the true value of @ is, the probability that Ho will be rejected is 0.05. Therefore, the 
value of the power function at every value of @ is 0.05. 


8. We know from Exercise 2 that the joint p.d.f. of X1,...,X, has a monotone likelihood ratio in the 
statistic I X?. Therefore, by Theorem 9.3.1, a test which rejects Ho when yx? > cwill bea 
UMP ie. To achieve a specified level of significance ag, the constant c diate. be chosen so that 
Pr bs X? > c/o? =2] = ag. Since = X? has a continuous distribution and not a discrete distribu- 
Gos, ties will be a value of c which sjaeies this equation for any specified value of ag (0 < ao < 1). 

9. The first part of this exercise was answered in Exercise 8. When n = 10 and o? = 2, the distribution 
of Y = 3 X?/2 will be the x? distribution with 10 degrees of freedom, and it is found from a table of 
this decamution that Pr(Y > 18.31) = 0.05. Also, 

oo 2 5 
(ox S¢\o° = 2 - PY > ) 


Therefore, if this probability is to be equal to 0.05, then c/2 = 18.31 or c = 36.62. 
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nm 
Let Y = > X;. As in Example 9.3.7, a test which specifies rejecting Ho if Y > cis a UMP test. When 


i=l 
n = 20 and p = 1/2, it is found from the table of the binomial distribution given at the end of the book 
that 


Pr(Y > 14) = .0370 + .0148 + .0046 + .0011 + .0002 = .0577. 


Therefore, the level of significance of the UMP test which rejects Hyp when Y > 14 will be 0.0577. 
Similarly, the UMP test that rejects Hp when Y > 15 has level of significance 0.0207. 


It is known from Exercise 1 that the joint p.f. of X1,...,Xn has a monotone likelihood ratio in he 
nm 
statistic Y = by X;. Therefore, by Theorem 9.3.1, a test which rejects Hp when Y > c will be a UMP 


i=1 
test. When \ = 1 and n = 10, Y will have a Poisson distribution with mean 10, and it is found from 
the table of the Poisson distribution given at the end of this book that 


Pr(Y > 18) = .0071 + .0037 + .0019 + .0009 + .0004 + .0002 + .0001 = .0143. 


Therefore, the level of significance of the UMP test which rejects Hp when Y > 18 will be 0.0143. 


Change the parameter from 9 to ¢ = —@. In terms of the new parameter ¢, the hypotheses to be tested 
are: 


Hg? CS 06; 
Ay: €>-6o. 


Let gn(X |¢) = fn(X | — ¢) denote the joint p.d-f. or the joint p.f. of X1,...,X, when ¢ is regarded 
as the parameter. If ¢) < 2, then 6; = —¢; > —C2 = 02. Therefore, the ratio gn/(X | ¢2)/gn(X |G) 
will be a decreasing function of r(X). It follows that this ratio will be an increasing function of the 
statistic s(X) = —r(X). 


Thus, in terms of ¢, the hypotheses have the same form as the hypotheses (9.3.8) and g,(x |¢) has a 
monotone likelihood ratio in the statistic s(X). Therefore, by Theorem 9.3.1, a test which rejects Ho 
when s(X) > c’, for some constant c’, will be a UMP test. But s(X) > c if and only if T= r(X) <c, 
where c = —c’. Therefore, the test which rejects Hy when T < c will be a UMP test. If c is chosen to 
satisfy the relation given in the exercise, then it follows from Theorem 9.3.1 that level of significance 
of this test will be apo. 


(a) By Exercise 12, the test which rejects Hp when X,, < c will be a UMP test. For the level of 
significance to be 0.1, c should be chosen so that Pr(X, < c|u = 10) = 0.1. In this exercise, 
n = 4. When p = 10, the random variable z = 2(X, — 10) has a standard normal distribution 
and Pr(X, < c|u = 10) = Pr[Z < 2(c — 10)}. It is found from a table of the standard normal 


distribution that Pr(Z < —1.282) = 0.1. Therefore, 2(¢ — 10) = —1.282 or c = 9.359. 


(b) When pz: = 9, the random variable 2(X,, — 9) has the standard normal distribution. Therefore, the 
power of the test is 


Pr(Xn < 9.359 | u = 9) = Pr(Z < 0.718) = 6(0.718) = 0.7636, 


where we have interpolated in the table of the normal distribution between 0.71 and 0.72. 


14. 


15. 


16. 


17. 
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(c) When yw = 11, the random variable Z = 2(X,, — 11) has the standard normal distribution. There- 
fore, the probability of rejecting Hp is 


Pr(Xp > 93359 | = 11) = Pr(Z > —3.282) = Pr(Z < 3.282) = 0(3.282) = 0.9995. 


n 
By Exercise 12, a test which rejects Hp when S > Xi < c will be a UMP test. When n = 10 and 
i=1 


n 

A= 1,507, X; will have a Poisson distribution with mean 10 and ag = Pr (>: AG 2 A= :). From 
i=1 

a table of the Poisson distribution, the following values of ag are obtained. 


= 0,a9 = .0000; 

= l,aop = .0000 + .0005 = .0005; 

2,a9 = .0000 + .0005 + .0023 = .0028; 

3,9 = .0000 + .0005 + .0023 + .0076 = .0104; 

= 4,a9 = .0000 + .0005 + .0023 + .0076 + .0189 = .0293. 


eo a 02 4 0 
II 


For larger values of c,a9 > 0.03. 


By Exercise 4, the joint p.d.f. of X1,...,X» has a monotone likelihood ratio in the statistic —Xp. 
Therefore, by Exercise 12, a test which rejects Hy when —X,, < ¢, for some constant c’, will be a UMP 
test. But this test is equivalent to a test which rejects Hy when X,, > c, where c = —c’. Since Xp, has 
a continuous distribution, for any specified value of ag(0 < ao < 1) there exists a value of c such that 
Pr(Xp >c|8 =1/2) =a0. 


We must find a constant c such that when n = 10,Pr(X, > c|@ = 1/2) = 0.05. When 6 = 1/2, 
each observation X; has an exponential distribution with 8 = 1/2, which is a gamma distribution 
nm 


with parameters a = 1 and 8 = 1/2. Therefore, SX; has a gamma distribution with parameters 
i=1 
a=n=10and 8 = 1/2, which is a x? distribution with 2n = 20 degrees of freedom. But 


P(Xp > el =5) =Pr(Soxi > 1018 = 5). 


a=1 


It is found from a table of the x? distribution with 20 degrees of freedom that Pr(37"_, X; > 31.41) = 
0.05. Therefore, 10c = 31.41 and c = 3.141. 


In this exercise, Ho is a simple hypothesis. By the Neyman-Pearson lemma, the test which has maximum 
power at a particular alternative value 0; > 0 will reject Ho if f(x|0 = 01)/f(x|@ =0) >c, where c is 
chosen so that the probability that this inequality will be satisfied when @ = 0 is ag. Here, 


f(@]@=0) ~ © 


if and only if (1 — c)a? + 2cO;x > cO? — (1 —c). For each value of 6, the value of c is to be chosen so 
that the set of points satisfying this inequality has probability ag when @ = 0. For two different values 
of 0;, these two sets will be different. Therefore, different test procedures will maximize the power at 
the two different values of #;. Hence, no single test is a UMP test. 
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18. The UMP test will reject Hp when X, > c, where Pr(X, > c|u =0) = Pr(./n Xn > Vne| pp =0) = 
0.025. However, when pp = 0,,\/n X», has the standard normal distribution. Therefore, Pr(,/n Xp > 
1.96 | w = 0) = 0.025. It follows that \/nc = 1.96 and c= 1.96n-1/?. 


(a) When p = 0.5, the random variable Z = \/n(X,, — 0.5) has the standard normal distribution. 
Therefore, 
7(0.5|6*) = Pr(Xp > 1.96n7/? | uw = 0.5) = Pr(Z > 1.96 — 0.5n1/?) 
= Pr(Z <0.5n'/? — 1.96) = 6(0.5n1/? — 1.96). 
But (1.282) = 0.9. Therefore, 7(0.5|5*) > 0.9 if and only if 0.5n!/? — 1.96 > 1.282, or, equiv- 
alently, if and only if n > 42.042. Thus, a sample of size n = 43 is required in order to have 


m(0.5|6*) > 0.9. Since the power function is a strictly increasing function of ju, it will then also 
be true that 7(0.5|6*) > 0.9 for u > 0.5. 


(b) When yp = —0.1, the random variable Z = \/n(Xp + 0.1) has the standard normal distribution. 
Therefore, 
m(—0.1|6*) = Pr(Xp > 1.96n—'/? |p = —0.1) = Pr(Z > 1.96 + 0.1n'/?) 
= 1-6(1.96 +0.1n"). 
But ©(3.10) = 0.999. Therefore, 7(—0.1|6*) < 0.001 if and only if 1.96 + 0.1n!/? > 3.10 or, 
equivalently, if and only if n > 129.96. Thus, a sample of size n = 130 is required in order to have 


m(—0.1|6*) < 0.001. Since the power function is a strictly increasing function of ju, it will then 
also be true that (| 6*) < 0.001 for uw < —0.1. 


19. (a 


YS 


Let f(x|) be the joint p.d.f. of X given yw. For each set A and i = 0,1, 

P(X €Alu=p)= [yf flelmae. (8.9.7) 
It is clear that f(a|~o) > 0 for all w and so is f(x|u1) > O for all a. Hence (8.9.7) is strictly 
positive for i = 0 if and only if it is strictly positive for i = 1. 
(b) Both 6 and 6; are size ag tests of Hj : uw = uo versus Hj : uw > po. Let 
A = {a:6 rejects but 6; does not reject}, 


B {x : 6 does not reject but 6; rejects}, 
C = {«: both tests reject}. 


Because both tests have the same size, it must be the case that 

Pr(X € Alu = Uo) + Pr(X € |u = po) = ao = Pr(X € Blu = po) + Pr(X € |w = po). 
Hence, 

Pr(X € Alu = po) = Pr(X € Blu = pi). (S.9.8) 


Because of the MLR and the form of the test 6,, we know that there is a constant c such that for 
every [/ > fo and every a € B and every ye A, 


Fale) J, fly) (S.9.9) 


f(|H0) f(y|uo) 


Now, 


w(ul6) = fy [teat [Gf felwae. 
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Also, 


w(ulbi) = fp f felwde+ [yf Felwae 


It follows that, for 4 > Uo, 


m(qlbi)— mul) = fp f teolwde— fy f Flelwae 
=i area. slno)dar— | Tit? nae 
foe Pe hee 


where the inequality follows from (S.9.9), and the final equality follows from (S.9.8). 


9.4 Two-Sided Alternatives 


Commentary 


This section considers tests for simple (and interval) null hypotheses against two-sided alternative hypotheses. 
The concept of unbiased tests is introduced in a subsection at the end. Even students in a mathematical 
statistics course may have trouble with the concept of unbiased test. 


Solutions to Exercises 


1. If m(w|5) is to be symmetric with respect to the point 4 = ju9, then the constants c; and cz must be 
chosen to be symmetric with respect to the value jug. Let c, = fo — & and cy = wo +k. When p = LU, 
the random variable Z = n!/?(X,, — jo) has the standard normal distribution. Therefore, 


™(uo|5) = Pr(Xn < wo —k| Mo) + Pr(Xn > Ho + | uo) 
= Pr(Z < —n/*k) + Pr(Z > n/?k) 
2Pr(Z > n/?k) = 2[1 — O(n/?k)). 


Since k must be chosen so that (jig | 6) = 0.10, it follows that ®(n1/2k) = 0.95. Therefore, n!/2k = 1.645 
and k = 1.645n~1/?. 


2. When ps = jug, the random variable Z = n!/2(X,, — jug) has the standard normal distribution. Therefore, 


™(uo|5) = Pr(Xn < e1| Mo) + Pr(Xn > €2| Ho) 
Pr(Z < -1.96) + Pr[Z > n/?(ey — puo)] 
®(—1.96) + 1 — ®[n!/? (cy — W0)| 

= 1.025 — &[n!/? (cp — puo)]. 


If we are to have (jig | 6) = 0.10, then we must have ®[n!/?(c2—j19)| = 0.925. Therefore, n!/2(c2—ju9) = 
1.439 and cy = po +: 1.439n7!/2. 


3. From Exercise 1, we know that if cy = pig — 1.645n7!/? and cp = po + 1.645n71/2, then m(pi9 | 5) = 
0.10 and, by symmetry, 7(~9 + 1|6) = (wo — 1] 6). Also, when pp = fo + 1, the random variable 
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n'/? (Xp — flo — 1) has the standard normal distribution. Therefore, 


mo + 1/6) = Pr(Xn <e1|uot+ 1) +Pr(Xn = c2| uo +1) 
= Pr(Z < —1.645 — n/?) + Pr(Z > 1.645 — n/?) 
®(—1.645 — n¥/?) + &(n1/? — 1.645). 


For n=9, = (fu + 1| 5) = ®(—4.645) + 6(1.355) < 0.95. 
For n=10, (fu +1|6) = ©(—4.807) + (1.517) < 0.95. 
For n=11, = r(o + 1| 5) = &(—4.962) + 6(1.672) > 0.95. 


4. If we choose cy and cg to be symmetric with respect to the value 0.15, then it will be true that 
m(0.1|6) = 2(0.2|6). Accordingly, let c, = 0.15 — k and cp = 0.15 + k. When pw = 0.1, the random 


variable Z = 5(X,, — 0.1) has a standard normal distribution. Therefore, 


m(0.1]6) = Pr(Xy <c|0.1) +Pr(Xp, > c2|0.1) 
Pr(Z < 0.25 — 5k) + Pr(Z > 0.25 + 5k) 
(0.25 — 5k) + ®(—0.25 — 5k). 


We must choose k so that 7(0.1|6) = 0.07. By trial and error, using the table of the standard normal 
distribution, we find that when 5k = 1.867, 


(0.1| 5) = ®(—1.617) + ®(—2.117) = 0.0529 + 0.0171 = 0.07. 


Hence, k = 0.3734. 


5. As in Exercise 4, 
m(0.1|6) = Pr[Z < 5(c, — 0.1)] + Pr[Z > 5(ce — 0.1)| = ®(5c; — 0.5) + ©(0.5 — 5c). 
Similarly, 
m(0.2|6) = Pr[Z < 5(c, — 0.2)] + Pr[Z > 5(co — 0.2)| = (5c, — 1) + ®(1 — 5cy). 


Hence, the following two equations must be solved simultaneously: 


®(5c, — 0.5) + (0.5 — Beg) 
®(5c, — 1) + ®(1 — 5ea) 


0.02, 
0.05. 


By trial and error, using the table of the standard normal distribution, it is found ultimately that if 
5c, = —2.12 and 5cg = 2.655, then 


®(5c; — 0.5) + (0.5 — 5c) = &(—2.62) + ®(—2.155) = 0.0044 + 0.0155 = 0.02. 
and 


®(5c, — 0.1) + ®(1 — 5c) = ©(—3.12) + ©(—1.655) = 0.0009 + 0.0490 = 0.05. 
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6. Let T = max(X1,...,X,). Then 


: for0<t<@ 
fulX|0)= 4 ge HROS*SE 
0 otherwise. 


Therefore, for 0, < 62, 


(2) forO<t< 44, 
Op 


oe) for 0, <t < @o. 


fn X [02) _ 
Fal X 11) 


It can be seen from this relationship that f,(X |@) has a monotone likelihood ratio in the statistic T 
(although we are being somewhat nonrigorous by treating oo as a number). 


For any constant c (0 < c < 3), Pr(T > c|@ = 3) = 1 -— (c/3)”. Therefore, to achieve a given level of 
significance ag, we should choose c = 3(1—ag)!/”. It follows from Theorem 9.3.1 that the corresponding 
test will be a UMP test. 


7. For 6 > 0, the power function is 7(0|6) = Pr(T’ > c|@). Hence, 
0 for 9d <«, 
ae lo 1- (5) for @ > ¢: 
60 
The plot is in Fig. $.9.4. 


1(6/8) 


Og 


Figure $.9.4: Figure for Exercise 7 of Sec. 9.4. 


8. (a) It follows from Exercise 8 of Sec. 9.3 that the specified test will be a UMP test. 
(b) For any given value of c (0 < c < 3), Pr(T < c|@ = 3) = (c/3)". Therefore, to achieve a given 
level of significance a9, we should choose c = 3a os 


9. A sketch is given in Fig. S.9.5. 


10. (a) Let ap = 0.05 and let cy = 3a!" as in Exercise 8. Also, let cg = 3. Then 
(0|6) =Pr(T < 3a!" |0) + Pr(T > 3/6). 


Since Pr(T > 3|6) = 0 for 6 < 3, the function 7(6|6) is as sketched in Exercise 10 for 0 < 3. For 
0 > 3, 


3a!" |" 4. 1 7 (5) | Sie (8.9.10) 
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0 c 3 ) 


Figure $.9.5: Figure for Exercise 9 of Sec. 9.4. 


(b) In order for a test 6 to be UMP level ag, for (9.4.15), it necessary and sufficient that the following 
three things happen: 
e 6 has the same power function as the test in Exercise 6 for 9 > 3. 
e 6 has the same power function as the test in Exercise 8 for 0 < 3. 
e 7(3|d) < ao. 
Because co = 3, we saw in part (a) that 7(0|6) is the same as the power function of the test in 


Exercise 8 for 9 < 3. We also saw in part (a) that 7(3|6) = 0.05 = ap. For @ > 3, the power 
function of the test in Exercise 6 is 


— anyi/n\" n 
Pr(T > 3(1 — a0) "/"|6) =1— (ea) = (5) fi 2: 


It is straightforward to see that this is the same as (S.9.10). 


11. It can be verified that if cj and cg are chosen to be symmetric with respect to the value jig, then the 
power function m(| 6) will be symmetric with respect to the point jz = zp and will attain its minimum 
value at 44 = wo. Therefore, if c; and cg are chosen as in Exercise 1, the required conditions will be 
satisfied. 


12. The power function of the test 6 described in this exercise is 


7(8|6) = 1 — exp(—cif) + exp(—c28). 


(a) In order for 6 to have level of significance ag, we must have 7(1|d) < ao. Indeed, the test will have 
size ag exactly if 


ag = 1 — exp(—c1) + exp(—cy). 
(b) We can let cj = — log(1 — ag/2) and cz = — log(ag/2) to solve this equation. 


13. The first term on the right of (9.4.13) is 


Cor nk ae n 
5 fe Tin) exp(—t0)dt g alain, ) 


The second term on the right of (9.4.13) is the negative of 


n pe grt - n 
5 if Tmt: exp(—t0)dt = GG(x;n + 1,8). 
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9.5 Thet Test 


Commentary 


This section provides a natural continuation to Sec. 9.1 in a modern statistics course. We introduce the t 
test and its power function, defined in terms of the noncentral t distribution. The theoretical derivation of 
the t test as a likelihood ratio test is isolated at the end of the section and could easily be skipped without 
interrupting the flow of material. Indeed, that derivation should only be of interest in a fairly mathematical 
statistics course. 

As with confidence intervals, computer software can replace tables for obtaining quantiles of the t distri- 
butions that are used in tests. The R function qt can compute these. For computing p-values, one can use pt. 
The precise use of pt depends on whether the alternatoive hypothesis is one-sided or two-sided. For testing 
Ho : w < po versus Hy : 4 > po using the statistic U in Eq. (9.5.2), the p-value would be 1-pt(u,n-1), 
where u is the observed value of U. Fot the opposite one-sided hypotheses, the p-value would be pt (u,n-1). 
For testing Ho : uw = po versus Hy : w ¥ po, the p-value is 2*(1-pt (abs (u) ,n-1)). The power function of 
at test can be computed using the optional third parameter with pt, which is the noncentrality parameter 
(whose default value is 0). Similar considerations apply to the comparison of two means in Sec. 9.6. 


Solutions to Exercises 


1. We computed the summary statistics Z, = 1.379 and o/ = 0.3277 in Example 8.5.4. 


(a) The test statistic is U from (9.5.2) 

9 1.379 — 1.2 
0.3277 
We reject Ho at level ag = 0.05 if U > 1.833, the 0.95 quantile of the ¢ distribution with 9 degrees 
of freedom. Since 1.727 F 1.833, we do not reject Ho at level 0.05. 

(b) We need to compute the probability that a t random variable with 9 degrees of freedom exceeds 
1.727. This probability can be computed by most statistical software, and it equals 0.0591. With- 


out a computer, one could interpolate in the table of the t distribution in the back of the book. 
That would yield 0.0618. 


U = 10" = 177. 


2. When pio = 20, the statistic U given by Eq. (9.5.2) has a ¢ distribution with 8 degrees of freedom. The 
value of U in this exercise is 2. 


(a) We would reject Ho if U > 1.860. Therefore, we reject Ho. 

(b) We would reject Ho if U < —2.306 or U > 2.306. Therefore, we don’t reject Ho. 

(c) We should include in the confidence interval, all values of ju for which the value of U given by Eq. 
(9.5.2) will lie between —2.306 and 2.306. These values form the interval 19.694 < ug < 24.306. 


3. It must be assumed that the miles per gallon obtained from the different tankfuls are independent and 
identically distributed, and that each has a normal distribution. When pp = 20, the statistic U given 
by Eq. (9.5.2) has the ¢ distribution with 8 degrees of freedom. Here, we are testing the following 
hypotheses: 


Ho: p = 20, 
Ay: pp < 20. 


We would reject Ho if U < —1.860. From the given value, it is found that X, = 19 and S? = 22. 
Hence, U = —1.809 and we do not reject Ho. 
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. When jp = 0, the statistic U given by Eq. (9.5.2) has the ¢ distribution with 7 degrees of freedom. 


Here 


—112_ 


X= 


—1.4 


and 

ut — 

S°(Xj — Xn)? = 43.7 — 8(1.4)? = 28.02. 

i=l 
The value of U can now be found to be —1.979. We should reject Hp if U < —1.895 or U > 1.895. 
Therefore, we reject Ho. 


. It is found from the table of the t distribution with 7 degrees of freedom that c; = —2.998 and cg lies 


between 1.415 and 1.895. Since U = —1.979, we do not reject Ho. 


. Let U be given by Eq. (9.5.2) and suppose that c is chosen so that the level of significance of the test 


is a9. Then 
m(u,07|5) = Pr(U > c|p,0°). 


If we let Y = n!/?(X,,—p)/o and Z = 74 (X; — Xn)? /o?, then Y will have a standard normal distri- 
bution, Z will have a y? distribution with n — 1 degrees of freedom, and Y and Z will be independent. 
Also, 


Y+ni/2 ia as 
Pe isa) - ) 
[Z/(n — 1)]¥? 


It follows that all pairs (4,07) which yield the same value of (1 — fio) /o will yield the same value of 
™(H, 07 | 6). 


. The random variable T = (X — s)/o will have the standard normal distribution, the random variable 


nm 
ie > be / o° will have a y? distribution with n degrees of freedom, and T and Z will be independent. 


i=1 
Therefore, when p = po, the following random variable U will have the t distribution with n degrees of 
freedom: 


L ni/?(X — 0) 


[Z/nf/2 pn 2 
Ye 
ow 


The hypothesis Ho would be rejected if U > c. 


. When o? = of, 97/02 has a x? distribution with n — 1 degrees of freedom. Choose c so that, when 


o* = of, Pr($2/o% > c) = ap, and reject Hp if $2/a% > c. Then m(p, 07 | 5) = ag if 0? = of. If a? £ 0, 
then Z = S?/o? has the y? distribution with n — 1 degrees for freedom, and $?/o2 = (07/o@)T. 
Therefore, 


m(p,0° |b) = Pr(Sp/o9 = c|u,0°) = Pr(T > cop/0°). 


If c/o? > 1, then m(p,07| 5) < Pr(T > c) = ap. If 2/07 < 1, then r(p,07|5) > Pr(T >c) = a0. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 
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. When o? = 4, S?/4 has the y? distribution with 9 degrees of freedom. We would reject Ho if 92/4 > 


16.92. Since $2/4 = 60/4 = 15, we do not reject. Ho. 


When o? = 4, 92/4 has the x? distribution with 9 degrees of freedom. Therefore, Pr(S$2/4 < 2.700) = 
Pr(S2/4 > 19.02) = 0.025. It follows that c) = 4(2.700) = 10.80 and cy = 4(19.02) = 76.08. 


U, has the distribution of X/Y where X has a normal distribution with mean w and variance 1, and 
Y is independent of X such that mY? has the y? distribution with m degrees of freedom. Notice that 
—X has a normal distribution with mean —w and variance 1 and is independent of Y. So U2 has the 
distribution of —X/Y = —Uj. So 


Pr(U2 < —c) = Pr(—U, < —c) = Pr(U, > c). 
The statistic U has the ¢ distribution with 16 degrees of freedom. The calculated value is 


VI. =3) 02 8 


~~ (S2/16}/2 ~~ (0.09/16)1/2 3 
and the corresponding tail area is Pr(U > 8/3). 


The test statistic is U = 169!/2(3.2 — 3)/(0.09)!/2 = 8.667. The p-value can be calculated using 
statistical software as 1 — Tig9(8.667) = 1.776 x 10-1. 


The statistic U has the t distribution with 16 degrees of freedom. The calculated value of U is 
0.1 4 


(0.09/16)!/2 3 


Because the alternative hypothesis is two-sided, the corresponding tail area is 


PU > 5) +Px(U < -5) = 2p(U > 5) 
3 3 3 


The test statistic is U = 169!/2(3.2 — 3.1)/(0.09)!/? = 4.333. The p-value can be calculated using 
statistical software as 2[1 — Ti69(4.333)] = 2.512 x 107°. 


The calculated value of U is 


—0.1 4 


(0.09/16)'/2 3" 


Since this value is the negative of the value found in Exercise 14, the corresponding tail area will be 
the same as in Exercise 14. 


The denominator of A(z) is still (9.5.11). The M.L.E. (fio, 62) is easier to calculate in this exercise, 
namely ji9 = lo (the only possible value) and 


1 nm 
or a S@ = lo)”. 
nm 
i=1 
These are the same values that lead to Eq. (9.5.12) in the text. Hence, A(a) has the value given in 
Eq. (9.5.14). For k < 1, A(x) < & if and only if 


JU] = (nk? — 1)? =e. 
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18. In this case No = {(u,07) : w > po}, and A(x) = 1 if FZ, > po. If Zp < po, then the numerator of A(x) 


is (9.5.12), and the formula for A(x) is the same as (9.5.13) with the branch labels switched. This time, 
A(a) is a non-decreasing function of u, the observed value of U. So for k < 1, A(a) < & if and only if 
U <c, for the same c as in Example 9.5.12. 


9.6 Comparing the Means of Two Normal Distributions 


Commentary 


The two-sample ¢ test is introduced for the case of equal variances. There is some material near the end 
of the section about the case of unequal variances. This is useful material, but is not traditionally covered 
and can be skipped. Also, the derivation of the two-sample ¢ test as a likelihood ratio test is provided for 
mathematical interest at the end of the section. 


Solutions to Exercises 


1. In this example, n = 5, m= 5, Xi, = 18.18, ¥,, = 17.32, S2-= 12.61, and SZ = 11,01. Then 


(5 +5 — 2)1/2(18.18 — 17.32) 


ye 7 
(3+4) " (11.01 +1261) 


= 0.7913. 


We see that |U| = 0.7913 is much smaller than the 0.975 quantile of the t distribution with 8 degrees 
of freedom. 


. In this exercise, m = 8, n = 6, Zm = 1.5125, J, = 1.6683, S% = 0.18075, and 9% = 0.16768. When 


[1 = 2, the statistic U defined by Eq. (9.6.3) will have the ¢t distribution with 12 degrees of freedom. 
The hypotheses are as follows: 


Ho: pn = pe, 
Ay: pa < pe. 


Since the inequalities are reversed from those in (9.6.1), the hypothesis Hp should be rejected if U < c. 
It is found from a table that c = —1.356. The calculated value of U is —1.692. Therefore, Ho is rejected. 


. The value c = 1.782 can be found from a table of the ¢ distribution with 12 degrees of freedom. Since 


U = —1.692, Ho is not rejected. 


. The random variable X;, —Y» has a normal distribution with mean 0 and variance (o?/m) + (ka?/n). 


Therefore, the following random variable has the standard normal distribution: 
Xm—-Yn 

ry EX 1/2 

(— “+ =) O71 

min 
The random variable $3./o? has a y? distribution with m-1 degrees of freedom. The random variable 
S?./(ko?) has a x? distribution with n — 1 degrees of freedom. These two random variables are inde- 
pendent. Therefore, Z2 = (1/07)($% + $?-/k) has a x? distribution with m+n — 2 degrees of freedom. 
72 


Since Z, and Z are independent, it follows that U = (m+n—2)'/?Z,/Z,’~ has the t distribution with 
m+n —2 degrees of freedom. 


10. 
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. Again, Hp should be rejected if U < —1.356. Since U = —1.672, Ho is rejected. 


. If py — 2 = A, the following statistic U will have the t distribution with m+n — 2 degrees of freedom: 


(apa) ee FX 


1 i 1/2 
(11) sy 


The hypothesis Hg should be rejected if either U < cy or U > co. 


. To test the hypotheses in Exercise 6, Hj would not be rejected if —1.782 < U < 1.782. The set of 


all values of A for which Hp would not be rejected will form a confidence interval for uw, — 2 with 
confidence coefficient 0.90. The value of U, for an arbitrary value of A, is found to be 


V12(—0.1558 — 2) 
0.3188 


i= 


It is found that —1.782 < U < 1.782 if and only if —0.320 < » < 0.008. 


. The noncentrality parameter when |jz1 — f2| = 0 is 


1 
“) = ———____ = 2.108. 


1 1\' 
€ . iD 
The degrees of freedom are 16. Figure 9.14 in the text makes it look like the power is about 0.23. Using 
computer software, we can compute the noncentral t probability to be 0.248. 


. The p-value can be computed as the size of the test that rejects Hp when |U| > |u|, where wu is the 


observed value of the test statistic. Since U has the ¢ distribution with m+n — 2 degrees of freedom 
when Ho is true, the size of the test that rejects Hy when |U| > |u| is the probability that a t random 
variable with m+n — 2 degrees of freedom is either less than —|u| or greater than |u|. This probability 
is 


Tm+n—2(—|ul) +1 - Timtn—2(|ul) = 2[1 — Tmtn—2(lul), 


by the symmetry of ¢ distributions. 


Let X; stand for an observation in the calcium supplement group and let Y; stand for and observation 
in the placebo group. The summary statistics are 


m = 10, 
n = Li, 
Im = 109.9, 
Tn = 113.9, 
s*? = 546.9, 
a, =- 12828; 


We would reject the null hypothesis if U > Tie (09) = 1.328. The test statistic has the observed value 
u = —0.9350. Since u < 1.328, we do not reject the null hypothesis. 
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(a) The observed value of the test statistic U is 


43 + 35 — 2)1/2(8.560 — 5.551 
i= _(43 + 35 — 2)17°(8.560 — 5.551) = 1,939. 


( aes yo (2745.7 + 783.9)1/? 
43 35 ; ; 

We would reject the null hypothesis at level aj = 0.01 if U > 2.376, the 0.99 quantile of the t 
distribution with 76 degrees of freedom. Since u < 2.376, we do not reject Hp at level 0.01. (The 
answer in the back of the book is incorrect in early printings.) 


(b) For Welch’s test, the approximate degrees of freedom is 
2745.7 .. 768.9 \* 

_ (a5 35 a) a 
he GOT 1 TRO? 
oy (“S ) +38 ea 
The corresponding ¢ quantile is 2.381. The test statistic is 
8.560 — 5.551 

2745.7 783.9 \ 1/2 
(z x 42 r 35 X =i) 


Once again, we do not reject Hp. (The answer in the back of the book is incorrect in early 
printings.) book is incorrect.) 


70.04. 


= 2.038. 


The W in (9.6.15) is the sum of two independent random variables, one having a gamma distribution 
with parameters (m — 1)/2 and m(m — 1)/(207) and the other having a gamma distribution with 
parameters (n — 1)/2 and n(n — 1)/(203). So, the mean and variance of W are 


_ man, mann _ oto 

BO) = mim —1)/(202) * n(n —1)/(203) aie n? 
4 4 
Var(W) = __(m-1/2 @-bp _ 207 205 


m?(m—1)?/(4o7) — n?(n—1)?/(403) — m?(m—1)  n?(n— 1)" 


The gamma distribution with parameters a and { has the above mean and variance if a/G = E(W) 
and a/6? = Var(W). In particular, a = E(W)?/ Var(W), so 


2a = 
mm—1) n(n —1) 
This is easily seen to be the same as the expression in (9.6.16). 
The likelihood ratio statistic for this case is 


SUP { (111 ,412,07):1 Fa} g(@,y | fa, M2, 07) 


A(z, y) = ; 
SUP {(y11 ,12,0)spr=no} GL Y | Has 2,0”) 


(S.9.11) 


Maximizing the numerator of (S.9.11) is identical to maximizing the numerator of (9.6.10) when FZ, < J, 
because we need j41 = p2 in both cases. So the M.L.E.’s are 


x mn MEm + NYy 
C= 2. 
m+n 


22 We — On (ena ee 8, 
Oo SS 
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Maximizing the denominator of (S.9.11) is identical to the maximization of the denominator of (9.6.10) 
when Z, < Y,,. We use the overall M.L.E.’s 


fa =Tm, fi2 =Un) and om = ce + a): 


m+n 


This makes A(a, y) equal to (1+ v2)~("+™/? where v is defined in (9.6.12). So A(x, y) > k if and only 
if v? < k’ for some other constant k’. This translates easily to |U| > c. 


9.7 The F Distributions 


Commentary 


The F distributions are introduced along with the F' test for equality of variances from normal samples. The 
power function of the F test is derived also. The derivation of the F' test as a likelihood ratio test is provided 
for mathematical interest. 

Those using the software R can make use of the functions df, pf, and qf which compute respectively 
the p.d.f., c.d.f., and quantile function of an arbitrary F distribution. The first argument is the argument of 
the function and the next two are the degrees of freedom. The function rf produces a random sample of F’ 
random variables. 


Solutions to Exercises 


1. The test statistic is V = [2745.7/42]/[783.9/34] = 2.835. We reject the null hypothesis if V is greater 
than the 0.95 quantile of the F’ distribution with 42 and 34 degrees of freedom, which is 1.737. So, we 
reject the null hypothesis at level 0.05. 


2. Let Y =1/X. Then Y has the F distribution with 8 and 3 degrees of freedom. Also 
1 
Pr(X >) =Pr Ga < ~) = 0.975. 
c 


It can be found from the table given at the end of the book that Pr(Y < 14.54) = 0.975. Therefore, 
1/e = 14.54 and c = 0.069. 


3. If Y has the ¢ distribution with 8 degrees of freedom, then X = Y? will have the F distribution with 1 
and 8 degrees of freedom. Also, 


0.3 = Pr(X > ¢)=Pr{¥ > 4/c) + Pr(¥ < —*/e) = 2Pr(Y > v/e). 


Therefore, Pr(Y > Wc) = 0.15. It can be found from the table given at the end of the book that 
Pr(Y > 1.108) = 0.15. Hence, «/c = 1.108 and ¢= 1.228. 


4. Suppose that X is represented as in Eq. (9.7.1). Since Y and Z are independent, 


E(X) = ~E (5) = ~E(Y)E (=) 
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Since Y has the x? distribution with m degrees of freedom, E(Y) = m. Since Z has the y? distribution 
with n degrees of freedom, 


E (5) = i? ~f(e)dz = Ty I z/2)-2 exp(—z/2)dz 


gln/2)—1 (yf 2) 1] 1 i 


2"/2T(n/2) ~ (n/2)-—1] n= 2 
Hence, E(X) = n/(n — 2). 


5. By Eq. (9.7.1), X can be represented in the form X = Y/Z, where Y and Z are independent and have 
identical x? distributions. Therefore, Pr(Y > Z) = Pr(Y < Z) = 1/2. Equivalently, Pr(X > 1) = 
Pr(X < 1) =1/2. Therefore, the median of the distribution of X is 1. 


6. Let f(x) denote the p.d.f. of X, let W = mX/(mX +n), and let g(w) denote the p.d.f. of W. Then 


dz on 1 

X =nW/|m(1- W)| ne a eae For 0 <w <1, 

nw dx 

n \(m/2)-1 ay (m/2)-1 (1—w)*)/?2 fn 1 
=) Gaara ae) oe 
1 m/2)—1 n/2)-1 
kama DT Sag V4, 
where 
k= T[(m + n)/2)m™/?2 n"/2 
— P(m/2)P(n/2) 
It can be seen g(w) is the p.d.f. of the required beta distribution. 
= = 16 = 
7. (a) Here, Xm = 84/16 = 5.25 and Y, = 18/10 = 1.8. Therefore, S? = > X? — 16(X,,) = 122 and 
i=1 


10 
a= Peis - 10(Y2) = 39.6. It follows that 
i=1 


‘ 1 : 1 
= Tri =7.625. and d2= Tie = 3.96. 
If o?7 = 03, the following statistic V will have the F distribution with 15 and 9 degrees of freedom: 
_ SP/15 
 3§2/9° 
(b) If the test is to be carried out at the level of significance 0.05, then Ho should be rejected if 
V > 3.01. It is found that V = 1.848. Therefore, we do not reject Ho. 


8. For any values of o? and 03, the random variable 


S?/(1507) 
S3/(90%) 


10. 


11. 


12. 


13. 


14. 


15. 
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has the F distribution with 15 and 9 degrees of freedom. Therefore, if 7? = 303, the following statistic 
V will have that F distribution: 


_ 82/45 
~~ 2/9: 


As before, Ho should be rejected if V > c, where c = 3.01 if the desired level of significance is 0.05. 


. When o? = 02, V has an F distribution with 15 and 9 degrees of freedom. Therefore, Pr(V > 3.77) = 


0.025, which implies that cz = 3.77. Also, 1/V has an F distribution with 9 and 15 degrees of freedom. 
Therefore, Pr(1/V > 3.12) = 0.025. It follows that Pr(V < 1/(3.12)) = 0.025, which means that 
e = 1/(3.12) = 0.321. 


Let V be as defined in Exercise 9. If of = ro3, then V/r has the F distribution with 15 and 9 degrees 
of freedom. Therefore, Ho would be rejected if V/r < c; or V/r > cz, where c; and cp have the values 
found in Exercise 9. 


For any positive number r, the hypothesis Ho in Exercise 9 will not be rejected if cy < V/r < cg. The 
set of all values of r for which Hp would not be rejected will form a confidence interval with confidence 
coefficient 0.95. But c, < V/r < c if and only if V/co < r < V/c,. Therefore, the confidence interval 
will contain all values of r between V/3.77 = 0.265V and V/0.321 = 3.12V . 


If a random variable Z has the y? distribution with n degrees of freedom, Z can be represented as 
the sum of n independent and identically distributed random variables Z1,...,Z,, each of which has 
a x? distribution with 1 degree of freedom. Therefore, Z/n = S~"_, Z;/n = Zn. As n — 00, it follows 
from the law of large numbers that Z,, will converge in probability to the mean of each Z;, which is 
1. Therefore Z/n +1. It follows from Eq. (9.7.1) that if X has the F' distribution with mo and n 
degrees of freedom, then as n — oo, the distribution of X will become the same as the distribution of 
Y/mo. 


Suppose that X has the F' distribution with m and n degrees of freedom, and consider the representation 


of X in Eq. (9.7.1). Then Y/m =i, Therefore, as m — oo, the distribution of X will become the 
same as the distribution of n/Z, where Z has a y” distribution with n degrees of freedom. Suppose that 
c is the 0.05 quantile of the y? distribution with n degrees of freedom. Then Pr(n/Z < n/c) = 0.95. 
Hence, Pr(X < n/c) = 0.95, and the value n/c should be entered in the column of the F' distribution 
with m =o. 


The test rejects the null hypothesis if the F’ statistic is greater than the 0.95 quantile of the F’ dis- 
tribution with 15 and 9 degrees of freedom, which is 3.01. The power of the test when o7 = 203 is 
1 — G45,9(3.01/2) = 0.2724. This can be computed using a computer program that evaluates the c.d-f. 
of an arbitrary F' distribution. 


The p-value will be the value of ag such that the observed v is exactly equal to either c, or cg. The 
problem is deciding wheter v = cy or v = cg since, we haven’t constructed a specific test. Since c, 
and cz are assumed to be the ag/2 and 1 — ao/2 quantiles of the F distribution with m — 1 and 
n — 1 degrees of freedom, we must have cy < cp and Gm=in—i(c1) < 1/2 and Gm_—1ijn-1(ce) > 1/2. 
These inequalities allow us to choose whether v = cy or v = co. Every v > 0 is some quantile 
of each F distribution, indeed the Cota quantile. If Gm—in-i(v) < 1/2, then v = c, and 
a9 = 2Gm—in-1(v). If Gm_-1yn_1(v) > 1/2, then v = co, and ap = 2[1 — Gy_in-i(v)]. (There is 0 
probability that Gm_—1n—1(v) = 1/2.) Hence, ag is the smaller of the two numbers 2G,,—1,n—-1(v) and 


2(1 = Gain-i¥)|. 
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16. 


Le. 


18. 


19. 
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In Example 9.7.4, v = 0.9491 was the observed value and 2G95,25(0.9491) = 0.8971, so this would be 
the p-value. 


The denominator of the likelihood ratio is maximized when all parameters equal their M.L.E.’s. The 


numerator is maximized when o? = 03. As in the text, the likelihood ratio then equals 


A(z, y) = dw? (1 — w)"”?, 


where w and d are defined in the text. In particular, w is a strictly increasing function of the observed 
value of V. Notice that A(x, y) < k when w < kj or w > kg. This corresponds to V < cy or V > cg. In 


order for the test to have level ag, the values c; and c2 have to satisfy Pr(V < ci) + Pr(V > c2) = ao 


when o? = o. 


The test found in Exercise 9 uses the values cj = 0.321 and co = 3.77. The likelihood ratio test rejects 
Ho when dw®(1—w)*) < k, which is equivalent to w°(1—w)° < k/d. If V = v, then w = 15v/(15v+9). 
In order for a test to be a likelihood ratio test, the two values c; and cp must lead to the same value of 
the likelihood ratio. In particular, we must have 


( 15¢, y (: 15¢1 ) _ ( 159 y (1 15cp y 

15c; +9 15, +9/ = \15e2 +9 15e. +9) © 

Plugging the values of c; and cy from Exercise 9 into this formula we get 2.555 x 107° on the left and 
1.497 x 10~° on the right. 


Let V* be defined as in (9.7.5) so that V* has the F' distribution with m— 1 and n — 1 degrees of 
freedom and The distribution of V = (0?/03)V*. It is straightforward to compute 


ot G on 
PV So) SPE Sear YS Pole Gm-—1,n—1 Pea 
i i 


and similarly, 


o2 
Pr(V > C2) =l- Cee 302 ” 
al 


(a) Apply the result of Exercise 18 with c; = G7p.20 (0.025) = 0.2952 and co = Gip.20(0.975) = 2.774 
and 03/0? =1/1.01. The result is 
G19,20(e1/1.01) + 1 — Gro,20(c2/1.01) = G10,20(0.289625) + 1 — G19 29 (2.746209) = 0.0503. 


(b) Apply the result of Exercise 18 with c; = G79.20 (0.025) = 0.2952 and ¢ = G7o.20(0.975) = 2774 
and o3/o? = 1.01. The result is 


G'40,20(1.01 x C1) +1- G'10,20(1.01 Xx C2) = G'10,20(0.2954475) +1- G'10,20 (2.80148) = 0.0498. 


(c) Since the answer to part (b) is less than 0.05 (the value of the power function for all parameters 
in the null hypothesis set), the test is not unbiased. 
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9.8 Bayes Test Procedures 


Commentary 


This section introduces Bayes tests for the situations described in the earlier sections of the chapter. It 
derives Bayes tests as solutions to decision problems in which the loss function takes only three values: O for 
correct decision, and one positive value for each type of error. The cases of simple and one-sided hypotheses 
are covered as are the various situations involving samples from normal distributions. The calculations are 
done with improper prior distributions so that attention can focus on the methodology and similarity with 
the non-Bayesian results. 


Solutions to Exercises 


1. In this exercise, €) = 0.9, & = 0.1, wo = 1000, and w; = 18,000. Also, 


and 


By the results of this section, it should be decided that the process is out of control if 


file) 5 Soto _ 1 


fo(z) ~ &:w, 2 
This inequality can be reduced to the inequality 2% — 102 > —log 2 or, equivalently, x > 50.653. 


2. In this exercise, 9 = 2/3, &; = 1/3, wo = 1, and w; = 4. Therefore, by the results of this section, it 
should be decided that fo is correct if 


fi(x) 2 fowo 1 


fo(z) ~ wy, 2 


Since f1(ax)/fo(x) = 4x, it should be decided that fo is correct if 4a? < 1/2 or, equivalently, if x < 1/2. 


3. In this exercise, ) = 0.8, €£; = 0.2, wo = 400, and w, = 2500. Also, if we let y = >“, a;, then 


fol xX) = SPB) 3" 
(x!) 
i 
and 
p(x) = eI) 


[[@) 
i=1 
By the results of this section, it should be decided that the failure was caused by a major defect if 


fi(X) x) EoWo 


fo(X) 7 ex(—4n) (5 + €1W} 


= 0.64 
3 
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or, equivalently, if 
4n + log (0.64) 
we (5) 
Oo —— 
Bg 


4. In this exercise, &) = 1/4, &; = 3/4, and wo = w; = 1. Let 21,...,2,, denote the observed values in the 
nm 


sample, and let y = > x;. Then 


i=1 
fi A) = 0.3)" 0.7)" 
and 
AMX) = 04.6)". 
By the results of this section, Hp should be rejected if 


fi(X) — Sowo 1 


fo(X)” Gu 3 


But 


wey (5°) GG) 25 
if and only if 


i Lae ] Pscig z 
—+tn log= = 
Bee Ba 83 


or, equivalently, if and only if 


5. (a) In the notation of this section ) = Pr(@ = 09) and f; is the p.f. or p.d.f. of X given 6 = 6;. By 
the law of total probability, the marginal p.f. or p.d.f. of X is €ofo(a) + i fi(a). Applying Bayes’ 
theorem for random variables gives us that 


pie) - __Sofola) 
PMO = ole) =F Fol@) + Ae) 
(b) The posterior expected value of the loss given X = @ is 
wokofo(@) 
fo fo(@) + & fix) 


w161 fi(x) . te east 
PA@tene if don’t reject Ho. 


if reject Ho, 


6. 


8. 
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The tests 6 that minimize r(d) have the form 


Don’t Reject Hp if fwofo(x) > &:wifi(x), 
Reject Ho if €owofo(x) < iwi fi(x), 
Do either if Eqwo fo(x) => furfi (a). 

Notice that these tests choose the action that has smaller posterior expected loss. If neither action 


has smaller posterior expected loss, these tests can do either, but either would then minimize the 
posterior expected loss. 


(c) The “reject Hp” condition in part (b) is 9wofo(x) < €:wifi(x). This is equivalent to wo Pr(6 = 
Oo|x) < wi[1 — Pr(@ = Oo|x)]. Simplifying this inequality yields Pr(@ = O9|@) < w1/(wo + w1). 
Since we can do whatever we want when equality holds, and since “Hp true” means 0 = 09, we see 
that the test described in part (c) is one of the tests from part (b). 


The proof is just as described in the hint. For example, (9.8.12) becomes 


oo 680 
[F720 C) nes | 8) fea | 8) ~ Ine | 0) Snlers | W)]eoa. 


The steps after (9.8.12) are unchanged. 


. We shall argue indirectly. Suppose that there is a such that the p-value is not equal to the posterior 


probability that Ho is true. First, suppose that the p-value is greater. Let ag be greater than the 
posterior probability and less than the p-value. Then the test that rejects Hp when Pr(Hp true|a) < ag 
will reject Hp, but the level ag test will not reject Ho because the p-value is greater than ag. This 
contradicts the fact that the two tests are the same. The case in which the p-value is smaller is very 
similar. 


(a) The joint p.d.f. of the data given the parameters is 


—(m+n)/2_(m+n) /2 T . 2 “ 2 
(Qn) emtn)/2-(entn)/? exp (-3 Se — 1) + lu — 2) | 
Use the following two identities to complete the proof of this part: 
SG - pay = Daee — Em)” +M(Em — ya)’, 
i=l a=1 
Sy ey =D 8) ee ey 
j=l j=l 


(b) The prior p.d.f. is just 1/r. 
i. As a function of j11, the posterior p.d-f. is a constant times exp(—mt(%m — 1)?/2), which is 
just like the p.d.f. of a normal distribution with mean Z,,, and variance 1/(mr). 
ii. The result for jg is similar to that for py. 
iii. As a function of (41, W2), the posterior p.d.f. looks like a constant times 
exp(—mr(%m — 11)*/2) exp(—n7 (Gn, — b2)"/2), 
which is like the product of the two normal p.d.f.’s from parts (i) and (ii). Hence, the con- 


ditional posterior distribution of (j11, u2) given 7 is that of two independent normal random 
variables with the two distributions from parts (i) and (ii). 
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iv. We can integrate yu, and pa out of the joint p.d.f. Integrating exp(—m7(%m — [1)”) yields 
(2n)'/?27-1/2. Integrating exp(—nr(y, — [2)?) yields (27)!/27—'/? also. So, the marginal 
posterior p.d.f. of 7 is a constant times r("+"—?)/? exp(—0.57(s2 + cme This is the p.d.f. of 
a gamma distribution with parameters (m+n —2)/2 and (s? + s%)/2, except for a constant 
factor. 


Since p41 and jig are independent, conditional on 7, we have that 4; — 2 has a normal distribution 
conditional on 7 with mean equal to the difference of the means and variance equal to the sum 
of the variances. That is, 4, — y2 has a normal distribution with mean 7, — Y, and variance 
Tt !(1/m+1/n) given r. If we subtract the mean and divide by the square-root of the variance, 
we get a standard normal distribution for the result, which is the Z stated in this part of the 
problem. Since the standard normal distribution is independent of 7, then Z is independent of 7 
and has the standard normal distribution marginally. 


Recall that 7 has a gamma distribution with parameters (m+n — 2)/2 and (s7 + s7)/2. If we 
multiply + by s2 + a the result W has a gamma distribution with the same first parameter but 
with the second parameter divided by s? + ae namely 1/2. 


Since Z and W are independent with Z having the standard normal distribution and W having 
the y? distribution with m +n — 2 degrees of freedom, it follows from the definition of the t 
distribution that Z/(W/[m +n — 2])!/2 has the ¢ distribution with m +n — 2 degrees of freedom. 
It is easy to check that Z/(W/[m +n — 2])'/? is the same as (9.8.17). 


The null hypothesis can be rewritten as 7; > 72, where 7; = 1/a7. This can be further rewritten as 
T/T2 > 1. Using the usual improper prior for all parameters yields the posterior distribution of 7, 
and 72 to be that of independent gamma random variables with 7; having parameters (m — 1)/2 
and s?/2 while 72 has parameters (n—1)/2 and OF /2. Put another way, 7,52 has the x? distribution 
with m—1 degrees of freedom independent of 7382 which has the y? distribution with n—1 degrees 
of freedom. This makes the distribution of 


_ 782 /(m—1) 


7282 /(n — 1) 


the F distribution with m—1 and n— 1 degrees of freedom. The posterior probability that Ho is 
true is 


s* /(m —1) s?/(m—1) 
Pram 2 1)—Pr (w > aie Sl Gp pet eee . 
3/(n—1) 33](n—1) 
The posterior probability is at most ap if and only if 
sz/(m — 1) 
s2/(n —1) 


This is exactly the form of the rejection region for the level ag F' test of Ho. 


ae (1 — a). 


m—1,n—-1 


This is a special case of Exercise 7. 


Using Theorem 9.8.2, the posterior distribution of 


5.134 — 3.990] 1 — 2 — 1.144 
26 +26 — 2)1/2___Ma = wa — [5.184 — 3.990] _ ia a — 1-144 
( ) (1/26 + 1/26)1/2 (63.96 + 67.39)1/2 0.4495 


is the ¢ distribution with 50 degrees of freedom. 


We can compute 


d—1.144 —d—1.144 
Pr(|H1 — p2| < d) = T50 | ~—— } — Too | —7,2-— }- 
r(|u1 — Ha] < d) so ( 0.4495 ) wo ( 0.4495 


LI; 


— 


KH 
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Probability 


Figure $.9.6: Figure for Exercise 10b of Sec. 9.8. 


A plot of this function is in Fig. 5.9.6. 


First, let Hp: 0 € O! and Hy, :0€Q". Then Q) = 1! and Q; =”. Since do is the decision that 
Ho is true we have dy = d’ and d, = d”. Since wy is the cost of type II error, and type I error is 
to choose 0 € Q” when 6 € 0’, wo = w’, and w, = w”. It is straightforward to see that everything 
switches for the other case. 


The test procedure is to 


Wi 
wo + wy’ 
and choose either action if the two sides are equal. 


choose d, if Pr(@ € Oolx) < (S.9.12) 


(S.9.13) 


In the first case, this translates to “choose d” if Pr(@ € Q’|xa) < w”/(w’ + w”), and choose either 
action if the two sides are equal.” This is equivalent to “choose d’ if Pr(@ € '|x) > w” /(w’ +w”), 
and choose either action if the two sides are equal.” This, in turn, is equivalent to “choose d’ 
if Pr(@ € OQ” |x) < w'/(w’ + w”), and choose either action if the two sides are equal.” This 
last statement, in the second case, translates to (S.9.12). Hence, the Bayes test produces the 
same action (d’ or d”) regardless of which hypothesis you choose to call the null and which the 
alternative. 


9.9 Foundational Issues 


Commentary 


This section discusses some subtle issues that arise when the foundations of hypothesis testing are examined 
closely . These issues are the relationship between sample size and the level of a test and the distinction 
between statistical significance and practical significance. The term “statistical significance” is not introduced 
in the text until this section, hence instructors who do not wish to discuss this issue can avoid it altogether. 


Solutions to Exercises 


1. 


(a) When pp = 0, X has the standard normal distribution. Therefore, c = 1.96. Since Ho should be 


rejected if |X| > c, then Ho will be rejected when X = 2. 
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f(X|u=0) __ exp (3%?) 
f(X|m=5) — exp[—3(X — 5)?] 
When X = 2, this likelihood ratio has the value exp(5/2) = 12.2. Also, 


1 
(b) = exp [525 — 10X)| : 
1 
f(X[w=0) __ exp (-2%”) 
f(X|u#=—-5) exp ees + 5)?| 
When X = 2, this likelihood ratio has the value exp(45/2) = 5.9 x 10%. 


= expl5 (25 +10X)]. 


2. When ps = 0, 100X,, has the standard normal distribution. Therefore, Pr (100| X,| > 1.96|=0) = 
0.05. It follows that c = 1.96/100 = 0.0196. 


(a) When p = 0.01, the random variable Z = 100(X,, — 0.01) has the standard normal distribution. 
Therefore, 
Pr( |X, | <elp=001) Pr(—1.96 < 100X,, < 1.96| u = 0.01) 
Pr(—2.96 < z < 0.96) 
= 0.8315 — 0.0015 = 0.8300. 


It follows that Pr(|X,| > ec] = 0.01) = 1 — 0.8300 = 0.1700. 


(b) When yt = 0.02, the random variable Z = 100(X,, — 0.02) has the standard normal distribution. 
Therefore, 


Pr | Xp | -< ele =002) 


Pr(—3.96 < Z < —0.04) 
= Pr(0.04 < Z < 3.96) = 1 — 0.5160. 
It follows that Pr(| Xn] <c|p = 0.02) = 0.5160. 


3. When p = 0, 100X,, has the standard normal distribution. The calculated value of 100X,, is 100(0.03) = 
3. The corresponding tail area is Pr(100 X, > 3) = 0.0013. 


4. (a) According to Theorem 9.2.1, we reject Ho if 


19 1 3 : 1 1 > (o; — 0.5)? 
nr exp | 5 2. a P< nr exp | —5 ) (ai — 0. : 
This inequality is equivalent to 
2 log(19 1 
Plogtt9) 4 1 
n 4 


That is, c, = 2log(19)/n + 1/4. For n = 1,100, 100000, the values of c, are 6.139, 0.3089, and 
0.2506. 


(b) The size of the test is 
Pr(Xn > cn|9 = 0) = 1— Ble x n'/?), 
For n = 1,100, 10000, the sizes are 4.152 x 10719, 0.001, and 0. 


<In- 


5. (a) We want to choose c, so that 
19[1 — ®(/nen)] = O(/n[en — 0.5). 
Solving this equation must be done numerically. For n = 1, the equation is solved for c, = 1.681. 
For n = 100, we need c, = 0.3021. For n = 10000, we need c, = 0.25 (both sides are essentially 
0). 
(b) The size of the test is 1— ®(c,n!/?), which is 0.0464 for n = 1, 0.00126 for n = 100 and essentially 
0 for n = 10000. 
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9.10 Supplementary Exercises 


Solutions to Exercises 
1. According to Theorem 9.2.1, we want to reject Hp when 
(1/2)? < (3/4) (1/4)°*. 

We don’t reject Hp when the reverse inequality holds, and we can do either if equality holds. The 
inequality above can be simplified to x > log(8)/log(3) = 1.892. That is, we reject Ho if X is 2 or 3, 
and we don’t reject Ho if X is 0 or 1. The probability of type I error is 3(1/2)? + (1/2)? = 1/2 and the 
probability of type I error is (1/4)? + 3(1/4)?(3/4) = 5/32. 

2. The probability of an error of type 1 is 


a = Pr(Rej. Ho | Ho) = Pr(X <5|@=0.1) =1—(.9)? =.41. 


The probability of an error of type 2 is 


B = Pr(Acc. Ho| Hi) = Pr(X > 6|@ = 0.2) = (.8)° =.33. 
3. It follows from Sec. 9.2 that the Bayes test procedure rejects Hy when f1(x)/fo(x) > 1. In this problem, 
fi(x) = (.8)""1(2) for ¢=1,2,..., 
and 
oe) = (.9)?-1(.1) for B12) dn 


Hence, Ho should be rejected when 2(8/9)*~! > 1 or x — 1 < 5.885. Thus, Hp should be rejected for 
X <6. 


4. It follows from Theorem 9.2.1 that the desired test will reject Ho if 


filz) _ f(@l@=0) , 


fo(z) — f(w|@ = 2) 


bole 


In this exercise, the ratio on the left side reduces to x/(1 — x). Hence, the test specifies rejecting 
Ho if x > 1/3. For this test, 


Hence, a(6) + 26(6) = 2/3. 
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. It follows from the previous exercise and the Neyman-Pearson lemma that the optimal procedure 6 


specifies rejecting Hy when x/(1—2) > k’ or, equivalently, when « > k. The constant k must be chosen 
so that 


1 
a= Pr(X > k|0=2) = f fe\e=2de= (1B)? 


Hence, k = 1 —a!/? and 
B(5) = Pr(X <k|@=0) =k? = (1—-a1/”)?. 
(a) The power function is given by 
n(0|5) =Pr(X > 0.9/6) = i f(x|@)dx = .19 — .098. 


(b) The size of 6 is 


sup «(0 |6) —.10. 
@>1 


. A direct calculation shows that for 0; < 02, 


d [f(x|@2)] _ ———-2(01 — 2) 
dx Fe | al a [2(1 = 01 )z + 6;|? oo 


Hence, the ratio f(x|02)/f(a|61) is a decreasing function of x or, equivalently, an increasing function 
of r(x) = —a. It follows from Theorem 9.3.1 that a UMP test of the given hypotheses will reject Ho 
when r(X) > c or , equivalently, when X < k. Hence, & must be chosen so that 


1 k 1 1 1 
05 = Pr (x <kle=5)= | s(w16=5) de = 50 +b, or k= 5(v14—1). 
0 


. Suppose that the proportions of red, brown, and blue chips are p,, po, and ps3, respectively. It follows 


from the multinomial distribution that the probability of obtaining exactly one chip of each color is 


3! 
Ta Pi P2P3 = 6p1P2P3- 
Hence, Pr(Rej. Ho | pi, p2, p3) = 1 — 6p paps. 
(a) The size of the test is 


a=Pr (Rei. Ho 


11 1) =2 
a°3°3)° © 


(b) The power under the given alternative distribution is Pr(Rej. Ho |1/7,2/7,4/7) = 295/343 = .860. 


. Let f;(x) denote the p.d.f. of X under the hypotheses H;(i = 0,1). Then 


fifz) _ J c for2 <Oorge>1, 
fo(z) |) v(a) for0<2<1, 


where v(x) is the standard normal p.d.f. The most powerful test 6 of size 0.01 rejects Hp when 
fi(x)/fo(x) > k. Since v(x) is strictly decreasing for 0 < x < 1, it follows that 6 will reject Ho if 


10. 


11. 


12. 
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X <0, X >1, or 0 < X <c, where c is chosen so that Pr(O < X < c| Hp) = .01. Since X has a 
uniform distribution under Hp , c = .01. Thus, 6 specifies rejecting Hp if X < .01 or X > 1. The power 
of 6 under Hy is 


Pr(X < .01| Hy) + Pr(X > 1| Ay) = ©(.01) + [1 — &(1)] = 5040 + .1587 = .6627. 


The usual ¢ statistic U is defined by Eq. (9.5.2) with n = 12 and yo = 3. Because the one-sided 
hypotheses Ho and Hy are reversed from those in (9.5.1), we now want to reject Ho if U < c. If 
Lio = 3, then U has the ¢ distribution with 11 degrees of freedom. Under these conditions, we want 
Pr(U < c) = 0.005 or equivalently, by the symmetry of the t distribution, Pr(U < —c) = 0.995. It is 
found from the table of the t distribution that —c = 3.106. Hence, Hp should be rejected if U < —3.106. 


It is known from Example 4 that the UMP test rejects Hp if Xn > c. Hence, c must be chosen so that 
0.95 = Pr(X, > c|@ = 1) = Pr[Z > Vn(c— 1], 
where Z has the standard normal distribution. Hence, \/n(c — 1) = —1.645, and c = 1 — (1.645)/n!/?, 


Since the power function of this test will be a strictly increasing function of 0, the size of the test will 
be 


— 1.645 
a = sup Pr(Rej. Ho|0) = Pr(Rej. Ho|6=0)=Pr}|X, >1- ( 1g =0| 
6<0 “ie 

= Pr(Z>n!/? — 1.645), 


where Z again has the standard normal distribution. When n = 16, 


ae = Pri > 2.355) = 0093. 


For 6, < 42, 
02-01 
fatal) _ (52)" (Ty, 
fn(a | 91) 04 — 
8 
which is an increasing function of T = II x;. Hence, the UMP test specifies rejecting Ho when T > ¢ 
i=1 
6 v 
or, equivalently, when —25 “log X; < k. The reason for expressing the test in this final form is that 


i=l 
when 6 = 1, the observations X,,...,Xg are i.i.d. and each has a uniform distribution on the interval 
8 


(0,1). Under these conditions, —2 oD log X; has a x? distribution with 2n = 16 degrees of freedom (see 


i=l 

Exercise 7 of Sec. 8.2 and Exercise 5 of Sec. 8.9). Hence, in accordance with Theorem 9.3.1, Ho should 

8 
be rejected if 255 log X; < 7.962, which is the 0.05 quantile of the x? distribution with 16 degrees of 

i=1 

. 8 
freedom, or equivalently if S log x; > —3.981. 

i=1 


o12 


13. 


14. 


15. 


16. 


17. 


18. 
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The y? distribution with 6 degrees of freedom is a gamma distribution with parameters a = 0/2 and B = 
1/2. Hence, it follows from Exercise 3 of Sec. 9.3 that the joint p.d.f. of X;,...,X, has a monotone 


n 


likelihood ratio in the statistic T = Il X;. Hence, there is a UMP test of the given hypotheses, and it 


i=1 
n 


specifies rejecting Hy when T' > c or, equivalently, when log T = .S log X; > k. 
i=1 


Let X, be the average of the four observations X;,...,X4 and let X» be the average of the six obser- 


4 10 
vations X5,..., X19. Let $? = SOX — X,)? and 92 = SG —X>2)*. Then $?/o? and $3/o? have 
i=l i=5 


independent x? distributions with 3 and 5 degrees of freedom, respectively. Hence, (5.97) /(3.93) has the 
desired F' distribution. 


It was shown in Sec. 9.7 that the F test rejects Hp if V > 2.20, where V is given by (9.7.4) and 2.20 is 
the 0.95 quantile of the F distribution with 15 and 20 degrees of freedom. For any values of o7 and o3, 
the random variable V* given by (9.7.5) has the F’ distribution with 15 and 20 degrees of freedom. 
When o? = 203, V* = V/2. Hence, the power when o? = 203 is 


1 
P*(Rej. Ho) = P*(V > 2.20) = P* (Sv > 1.10) = Pr(V* > 11), 


where P* denotes a probability calculated under the assumption that a? = 203 . 


The ratio V = S%/S% has the F distribution with 8 and 8 degrees of freedom, and so does 1/V = 
Se (52. Thus, 


05=]=PrT>e=]=PrV Se) Priv ><) =2 Prev > e). 


It follows that c must be the .975 quantile of the distribution of V, which is found from the tables to 
be 4.43. 


(a) Carrying out a test of size @ on repeated independent samples is like performing a sequence of 
Bernoulli trials on each of which the probability of success is a. With probability 1, a success 
will ultimately be obtained. Thus, sooner or later, Ho will ultimately be rejected. Therefore, the 
overall size of the test is 1. 


(b) As we know from the geometric distribution, the expected number samples, or trials, until a success 
is obtained is 1/a. 


If U is defined as in Eq. (8.6.9), then the prior distribution of U is the ¢ distribution with 2a9 = 2 
degrees of freedom. Since the ¢ distribution is symmetric with respect to the origin, it follows that 
under the prior distribution, Pr(Hp) = Pr(w < 3) = Pr(U < 0) = 1/2. It follows from (8.6.1) and 
(8.6.2) that under the posterior distribution, 


= ———_~ = 3.189 Ay= 1 
M1 1417 ’ 1 8, 
17 
Qy = 1+ my = 9.5, 
i (17)(.04) 
= 14+-(17 = 9.519. 


19. 


20. 


val 


22. 


23. 
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If we now define Y to be the random variable in Eq. (8.6.12) then Y = (4.24)(j — 3.19) and Y has the 
t distribution with 2a; = 19 degrees of freedom. Thus, under the posterior distribution, 


Pr( Ho) = Pr(g < 3) = Prl¥Y < (4.24)(8 —3.19)| =Pr(¥ <—.81) =Pr{Y > .81). 


It is found from the table of the ¢ distribution with 19 degrees of freedom that this probability is 
approximately 0.21. 


At each point 6 € 2;,7(@|6) must be at least as large as it is at any point in Qo, because 6 is unbiased. 
But sup 7(0|0) =a, at every point0€Q, . 
ENO 


Since 6 is unbiased and has size a, it follows from the previous exercise that 7(@|6) < a for all 6 inside 
the circle A and 7(@|6) > a for all 0 outside A. Since 7(@|06) is continuous, it must therefore be equal 
to @ everywhere on the boundary of A. Note that this result is true regardless of whether all of any 
part of the boundary belongs to Hp or Ay. 


Since Hp is simple and 6 has size a, then 7(9)|6) = a. Since 6 is unbiased, 7(0|6) > q@ for all other 
values of @. Therefore, 7(@|6) is a minimum at 0 = 09. Since 7 is assumed to be differentiable, it 
follows that 7’ (89 | 6) = 0. 


(a) We want Pr(X > c,| Ho) = Pr(Y > co| Ho) = .05. Under Ho, both X and Y/10 are standard 
normal. Therefore, cy = 1.645 and cg = 16.45. 


(b) The most powerful test of a size ag, conditional on observing X with a variance of a? is to reject 
Ho if X > o@~1(1 —ao). In this problem we are asked to find two such tests: one with ¢ = 1 and 
ag = 2.0 x 10~" and the other with ¢ = 10 and ag = 0.0999998. The resulting critical values are 

@1(1-2.0x10-") = 5.069, 
106~'(1 — 0.0999998) = 12.8155. 


(c) The overall size of a test in this problem is the average of the two conditional sizes, since the 
two types of meteorological conditions have probability 1/2 each. In part (a), the two conditional 
sizes are both 0.05, so that is the average as well. In part (b), the average of the two sizes is 
(2.0 x 10-7 + 0.0999998)/2 = 0.05 also. The powers are also the averages of the two conditional 
powers. The power of the conditional size ag test with variance a? is 


1 — 6(0@-1(1 — ag) — 10). 
The results are tabulated below: 


Part Good Poor Average 


(a) I 0 0.5 
(b) | 0.9999996 0.002435 | 0.5012 


(a) The data consist of both X and Y, where X is defined in Exercise 22 and Y = 1 if meteorological 
conditions are poor and Y = 0 if not. The joint p.f./p.d.f. of (X,Y) given O = @ is 


1 l-y 2 y 2 
sopra (4 fe — 6]? - fe — 6]). 
The Bayes test will choose Hp when 


‘1p —— a ies (te 7 2) 
2(27)1/2104 2 200 


1 l-y 9 y 9 
= mise & (- ; [x — 10] — shir 10?). 


314 


(b) 


Chapter 9. Testing Hypotheses 


It will choose H; when the reverse inequality holds, and it can do either when equality holds. This 
inequality can be rewritten by splitting according to the value of y. That is, choose Ho if 


«<5 + log(wofo/(wigi))/10 if y = 0, 
L<ot 10 log(wo&o/(w1é1)) ify =1. 


In order for a test to be of the form of part (a), the two critical values co and c; used for y = 0 and 
y = 1 respectively must satisfy c; — 5 = 100(co — 5). In part (a) of Exercise 22, the two critical 
values are cp = 1.645 and c,; = 16.45. These do not even approximately satisfy cj —5 = 100(cp —5). 
In part (b) of Exercise 22, the two critical values are co = 5.069 and cj = 12.8155. These 
approximately satisfy c, — 5 = 100(co — 5). 


The Poisson distribution has M.L.R. in Y, so rejection Hyp when Y < c is a UMP test of its size. 
With c = 0, the size is Pr(Y = 0/0 = 1) = exp(—n). 


The power function of the test is Pr(Y = 0|@) = exp(—7n6). 


25. Let J be the random interval that corresponds to the UMP test, and let J be a random interval that 
corresponds to some other level ao test. Translating UMP into what it says about the random interval 
I compared to J, we have for all 6 >c 


Ino 


Pr(c € 10) < Pr(ce€ J@). 


ther words, the observed value of J is a uniformly most accurate coefficient 1—ag confidence interval 


if, for every random interval J such that the observed value of J is a coefficient 1—ag confidence interval 


and 


for all 69 > 04, 


Pr(0; E I\0 _ 02) < Pr(0, E J\0 _ 62). 


Chapter 10 


Categorical Data and Nonparametric 
Methods 


10.1 Tests of Goodness-of-Fit 


Commentary 


This section ends with a discussion of some issues related to the meaning of the y? goodness-of-fit test for 
readers who want a deeper understanding of the procedure. 


Solutions to Exercises 


1. Let Y = Nj, the number of defective items, and let 0 = p,, the probability that each item is defective. 
The level ap test requires us to choose c, and cg such that Pr(Y < c,|@ = 0.1) + Pr(Y > c|@ = 0.1) 
is close to ag. We can compute the probability that Y = y for each y = 0,...,100 and arrange the 
numbers from smallest to largest. The smallest values correspond to large values of y down to y = 25, 
then some values corresponding to small values of y start to appear in the list. The sum of the values 
reaches 0.0636 when c, = 4 and cp = 16. So ag = 0.0636 is the smallest ag for which we would reject 
Hy :9=0.1 using such a test. 


l| 
3|[x 
Fs 
M= 
x 
| 
bo 
x| = 
+ 
oe 
See 
l| 
a, 
S| 
2 
n—_” 
| 
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3. We obtain the following frequencies: 


a O 1 2 3 4 5 6 7 8 9 
N; 25 16 19 20 20 22 24 15 14 25 


Since P? = 1/10 for every value of i, and n = 200, we find from Eq. (10.1.2) that Q = 7.4. If Q has 
the x? distribution with 9 degrees of freedom, Pr(Q > 7.4) = 0.6. 
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4. We obtain the following table: 


np? 6 12 6 


It is found from Eq. (10.1.2) that Q = 11/3. If Q has a x? distribution with 2 degrees of freedom, then 
the value of Pr(Q > 11/3) is between 0.1 and 0.2. 


5. (a) The number of successes is nX,, and the number of failures is n(1 — X,,). Therefore, 
oO = (nXn _ npo)” Ee [n(1 — Xn) =n — po)? 
npo n(1 — po) 

= Pr il 1 

= Xn _ Po) as 

po 1—po 

= MXn — Po 

po(1 — po) 
(b) If p= po, then E(X,,) = po and Var(X,,) = po(1—po)/n. Therefore, by the central limit theorem, 
the c.d.f. of 

Le — Po 


~ [po(l — po) /n]'?2 


converges to the c.d.f. of the standard normal distribution. Since Q = Z?, the c.d.f. of Q will 
converge to the c.d.f. of the x? distribution with 1 degree of freedom. 


6. Here, po = 0.3, n = 50, and X,, = 21/50. By Exercise 5, Q = 3.44. If Q has a x? distribution with 1 
degree of freedom, then Pr(Q > 3.4) is slightly greater than 0.05. 


7. We obtain the following table: 


O<zr<02 02<2<05 O05<2<08 08<2<1. 
Nj 391 490 580 339 
np? 360 540 540 360 


If Q has a y? distribution with 3 degrees of freedom, then Pr(Q > 11.34) = 0.01. Therefore, we should 
reject Ho if Q > 11.34. It is found from Eq. (10.1.2) that Q = 11.5. 


8. If Z denotes a random variable having a standard normal distribution and X denotes the height of a 
man selected at random from the city, then 


Pr (X < 66 
Pr (66 < X < 67.5 


) = Pr(Z < —2) =0.0227, 
) 
Pr (67.5 <_X < 68.5) 
) 
) 


( 
Pr(—2 < Z < —0.5) = 0.2858, 
Pr (-0.5 < Z < 0.5) = 0.3830, 
( 
( 


acre one 
Pr LX > 70 


= Pr(0.5 < Z <2) = 0.2858, 
= Pr(Z> 2) = 0.0227. 


Therefore, we obtain the following table: 
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N; np? 

xr < 66 18 11.35 
66<2<67.5 177 142.9 
67.5<2< 68.5 198 191.5 
68.5<x2<70 102 142.9 
x > 70 5 11.35 


It is found from Eq. (10.1.2) that Q = 27.5. If Q has a x? distribution with 4 degrees of freedom, then 
Pr(Q > 27.5) is much less than 0.005. 


9. (a) The five intervals, each of which has probability 0.2, are as follows: 
(—oo, —0.842), (—0.842, —0.253), (—0.253, 0.253), (0.253,0.842), (0.842, oo). 


We obtain the following table: 


—oo < x4 < —0.842 15 10 
—0.842 < x < —0.253 10 10 
—0.253 < x < 0.253 7 #10 
0.253 < a < 0.842 12 10 
0.842 <4%< co 6 10 


The calculated value of Q is 5.4. If Q has a y? distribution with 4 degrees of freedom, then 
Pr(Q > 5.4) = 0.25. 


(b) The ten intervals, each of which has probability 0.1, are as given in the following table: 


= 
Ss 


Ov Ot Ot OT OT OT Ot OT Ot OUSO 


—oo < £ < —1.282 
—1.282 < x < —0.842 
—0.842 < 7 < —0.524 
—0.524 < & < —0.253 

—0.253 <2 <0 
0<a < 0.253 

0.253 < x < 0.524 

0.524 < x < 0.842 

0.842 < x < 1.282 

1.282 <2%< co 


The calculated value of Q is 8.8. If Q has the y? distribution with 9 degrees of freedom, then the 
value of Pr(Q > 8.8) is between 0.4 and 0.5. 


10.2 Goodness-of-Fit for Composite Hypotheses 


Commentary 


The maximization of the log-likelihood in Eq. (10.2.5) could be performed numerically if one had appropriate 
software. The R functions optim and nlm can be used as described in the Commentary to Sec. 7.6 in this 
manual. 
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Solutions to Exercises. 


1. There are many ways to perform a x? test. For example, we could divide the real numbers into the 
intervals (—oo, 15], (15, 30], (30, 45], (45, 60], (60, 75], (75, 90], (90,00). The numbers of observations in 
these intervals are 14, 14, 4, 4, 3, 0, 2 


(a) The M.L.E.’s of the parameters of a normal distribution are 4 = 30.05 and o2 = 537.51. Using 
the method of Chernoff and Lehmann, we compute two different p-values with 6 and 4 degrees 
of freedom. The probabilities for the seven intervals are 0.2581, 0.2410, 0.2413, 0.1613, 0.0719, 
0.0214, 0.0049. The expected counts are 41 times each of these numbers. This makes Q = 24.53. 
The two p-values are both smaller than 0.0005. 

(b) The M.L.E.’s of the parameters of a lognormal distribution are fi = 3.153 and o2 = 0.48111. Using 
the method of Chernoff and Lehmann, we compute two different p-values with 6 and 4 degrees 
of freedom. The probabilities for the seven intervals are 0.2606, 0.3791, 0.1872, 0.0856, 0.0407, 
0.0205, 0.0261. The expected counts are 41 times each of these numbers. This makes Q = 5.714. 
The two p-values are both larger than 0.2. 


2. First, we must find the M.L.E. of 0. From Eq. (10.2.5), ignoring the multinomial coefficient, 


4 
L(0) = Ile" _ Geeta Nar Nay = gy ores Na Ne, where C = 4N16N24N3. 
i=0 


Therefore, 
log L(@) = log C+ (Ni + 2No + 3.N3 + 4N4) log é+ (4No + 3N, + 2No + N3) log(1 = 0). 
By solving the equation 0 log L(@)/00 = 0, we obtain the result 


6= Ni, +2No+3N3+4Nq — Ni +2No+3N3+4N4 
4(No + Ni + No + Nz + Ny) An , 


It is found that © = 0.4. Therefore, we obtain the following table: 


No. of 

Games N; N7;(0) 
0 33 25.92 
1 67 69.12 
2 66 69.12 
3. 15 30.72 
4 19 5.12 


It is found from Eq. (10.2.4) that Q = 47.81. If Q has a y? distribution with 5 — 1— 1 =3 degrees of 
freedom, then Pr(Q > 47.81) is less than 0.005. 
3. (a) It follows from Eqs. (10.2.2) and (10.2.6) that (aside from the multinomial coefficient) 
log L(6) = (Na+Ns5+ No) log2+ (2N, + Ny + Ns) log 6, + (2No + Na + Ne) log 69 
+(2N3 + Ns + Ne) log(1 — 61 — 02). 
By solving the equations 


O log L(@) —0 and O log L (8) 


6, 06, 
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we obtain the results 
A 2N,+N4+ N: A 2No+ Na+ NM 
Q, = SITs ond 6,577 
2n 2n 
where n = -°_, Nj. 


(b) For the given values, n = 150,09, = 0.2, and @2 = 0.5. Therefore, we obtain the following table: 


2-5 
2 36 37.5 
3. «14 13.5 
4 36 30 
5 20 18 
6 42 45 


It is found from Eq. (10.2.4) that Q = 4.37. If Q has the x? distribution with 6 —1—2 = 3 degrees 
of freedom, then the value of Pr(Q > 4.37) is approximately 0.226. 


4. Suppose that X has the normal distribution with mean 67.6 and variance 1, and that Z has the standard 
normal distribution. Then: 


m(O) = Pr(X <66) =Pr(Z < —1.6) = 0.0548, 
m(O) = Pr(66<X < 67.5) = Pr(-1.6 < Z < —0.1) = 0.4054, 
m3(0) = Pr(67.5 <X < 68.5) = Pr(-0.1 < Z < 0.9) = 0.3557, 
m4(O) = Pr (68.5 <X < 70) =Pr(0.9 < Z < 2.4) = 0.1759, 
m5(O) = Pr(X > 70) =Pr(Z > 2.4) = 0.0082. 
Therefore, we obtain the following table: 

t 16 287A 

2 177 2027 

3 198 177.85 

4 102 87.95 

5 5 4.1 


The value of Q is found from Eq. (10.2.4) to be 11.2. Since yz: and o? are estimated from the original 
observations rather than from the grouped data, the approximate distribution of Q when Ho is true lies 
between the y? distribution with 2 degrees of freedom and a y? distribution with 4 degrees of freedom. 


5. From the given observations, it is found that the M.L.E. of the mean O of the Poisson distribution is 


0 = X,, = 1.5. From the table of the Poisson distribution with 0 = 1.5, we can obtain the values of 


m(O). In turn, we can then obtain the following table: 


No. of tickets N; n7;(O) 


0 52 44.62 
1 60 66.94 
2 55 50.20 
3 18 = 25.10 
4 8 9.42 
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It is found from Eq. (10.2.4) that Q = 7.56. Since © is calculated from the original observations rather 
than from the grouped data, the approximate distribution of Q when Hp is true lies between the y? 
distribution with 4 degrees of freedom and the x? distribution with 5 degrees of freedom. The two 
p-values for 4 and 5 degrees of freedom are 0.1091 and 0.1822. 


. The value of 9 = X,, can be found explicitly from the given data, and it equals 3.872. However, before 


carrying cut the x? test, the observations in the bottom few rows of the table should be grouped together 
to obtain a single cell in which the expected number of observations is not too small. Reasonable choices 
would be to consider a single cell for the periods in which 11 or more particles were emitted (there 
would be 6 observations in that cell) or to consider a single cell for the periods in which 10 or more 
particles were emitted (there would be 16 observations in that cell). If the total number of cells after 
this grouping has been made is k, then under Ho the statistic Q will have a distribution which lies 
between the x? distribution with k — 2 degrees of freedom and the y? distribution with k — 1 degrees 
of freedom. For example, with ks = 12, the expected cell counts are 


54.3, 210.3, 407.1, 525.3, 508.4, 393.7, 254.0, 140.5, 68.0, 29.2, 11.3, 5.8 


The statistic Q is then 12.96. The two p-values for 10 and 11 degrees of freedom are 0.2258 and 0.2959. 


. There is no single correct answer to this problem. The M.L.E.’s fj = X, and 6? = S?/n should be 


calculated from the given observations. These observations should then be grouped into intervals and 
the observed number in each interval compared with the expected number in that interval if each of 
the 50 observations had the normal distribution with mean X,, and variance $?2/n. If the number of 
intervals is k, then when Hp is true, the approximate distribution of the statistic Q will lie between the 
x? distribution with k — 3 degrees of freedom and the y? distribution with k — 1 degrees of freedom. 


. There is no single correct answer to this problem. The M.L.E. B = 1/X,, of the parameter of the 


exponential distribution should be calculated from the given observations. These observations should 
then be grouped into intervals and the observed number in each interval compared with the expected 
number in that interval if each of the 50 observations had an exponential distribution with parameter 
1/Xy. If the number of intervals is k, then when Ho is true, the approximate distribution of the statistic 
Q will lie between a x? distribution with k — 2 degrees of freedom and the y? distribution with k — 1 
degrees of freedom. 


10.3. Contingency Tables 


Solutions to Exercises. 


1. Table $.10.1 contains the expected counts for this example. The value of the y? statistic Q calculated 


Table $.10.1: Expected cell counts for Exercise 1 of Sec. 10.3. 


Good grades | Athletic ability 
73 


from these data is Q = 21.5. This should be compared to the y? distribution with two degrees of 
freedom. The tail area can be calculated using statistical software as 2.2 x 107°. 
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Ni; —E;;)2 £ N2. RoC Nn2 
2 C5 “s id) = S (F205 + £5) - (Ese —In+n 
i=1 j=1 a i=1 j=1 a i=l j=l Aj 
R C y72 
(DoF -n 
i=1 j=1 Eij 


3. By Exercise 2, 


NA NZ 
Q= ye il +e i2 


= 1 £ 


But 


a 


> NS = 3 (N; =IN; ig — 


i=1 Ej i=l Ej i=l 


Ejx2 = 1 & 


In the first two sums on the right, we let Bie = NjiNi2/n, and in the third sum we let Evo = 
N42 /N41. We then obtain 


R R R R 

ma 
— F F a —— 

j=1 Lie N42 i=l No i=l N42 int a No N42 2 jay Fat 


It follows that 


Ns) Z Ni n 
i=l 


Since n = N41 + Nya, 


R 2 
n Na n 
Q= ot 
Nis =] Ei Ni2 


4. The values of Bij are as given in the following table: 


8 32 
12 48 


The value of Q is found from Eq. (10.3.4) to be 25/6. If Q has a x? distribution with 1 degree of 
freedom, then Pr(Q > 25/6) lies between 0.025 and 0.05. 


5. The values of Ei; are as given in the following table. 


77.27 94.35 49.61 22.77 
17.73 21.65 11.389 5.23 


The value of Q is found from Eq. (10.3.4) to be 8.6. If Q has the y? distribution with (2—1)(4—1) =3 
degrees of freedom, then Pr(Q > 8.6) lies between 0.025 and 0.05. 


6. The values of Ey; are as given in the following table: 
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7.0 7.9 
14.5 14.5 


The value of Q is found from Eq. (10.3.4) to be 0.91. If Q has the x? distribution with 1 degree of 
freedom, then Pr(Q > 0.91) lies between 0.3 and 0.4. 


7. (a) The values of pj; and p,; are the marginal totals given in the following table: 


0.3 


0.5 03 0.2 1.0 


It can be verified that p;; = pi,p+; for each of the 9 entries in the table. It can be seen in advance 
that this relation will be satisfied for every entry in the table because it can be seen that the 
three rows of the table are proportional to each other or, equivalently, that the three columns are 
proportional to each other. 


(b) Here is one example of a simulated data set 


152, 90 58 300 


(c) The statistic Q calculated by any student from Eq. (10.3.4) will have the x distribution with 
(3 — 1)(3 — 1) = 4 degrees of freedom. For the data in part (b), the table of £;; values is 


The value of Q is then 2.105. The p-value 0.7165. 


8. To test whether the values obtained by n different students form a random sample of size n from a y? 
distribution with 4 degrees of freedom, follow these steps: (1) Partition the positive part of the real line 
into k intervals; (2) Determine the probabilities p),...,p? of these intervals for the x? distribution with 
4 degrees of freedom; (3) Calculate the value of the statistic Q given by Eq. (10.1.2). If the hypothesis 
Hp is true, this statistic Q will have approximately the y? distribution with k — 1 degrees of freedom. 


9. Let Nijx denote the number of observations in the random sample that fall into the (7, j,k) cell, and let 


C T 
Nigd = Sys Ne Ne SS Naw 


j=1lk=1 i=1 k=1 


RC 
Ny+k = Sy Ne 


i=1 j=1 
Then the M.L.E.’s are 


Ni++ N43 Ni+k 


= j = 
»P+j+ = an »>P++k = 


Pit+ = 
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Therefore, when Hp is true, 


- ee e Nit + N4j4Ne+k 
Eijk = 1Pi44P4j+P++k = —— 
Since 7%, fi44 = yaP = y-h-1 b+ +k = 1, the number of parameters that have been estimated 
is (R—1)+(C—1)+(T-1) = R+C+T-—83. Therefore, when Ho is true, the approximate distribution 
of 
R C T —— Siak)e 
= Nijk = Figk)” 
g= yyy Rata 
i=1 j=1 k=1 ijk 


will be the x? distribution for which the number of degrees of freedom is RCT —1—(R+C+T-—3) = 
RCT —-R-C-T+2. 


10. The M.L.E.’s are 


F Nij+ r Ni+k 
Pij+ = — and pik = : 
Therefore, when Ho is true, 
Nij+ Nath 


Kijk = 2Dij+ Pr+k = . 


Since 3 3 p54 = > p++k = 1, the number of parameters that have been estimated is (RC — 1) + 
i=1g=1 
(T-1)=RC+T—- " Therefore, when Ho is true, the approximate distribution of 


will be the y? distribution for which the number of degrees of freedom is RCT — 1— (RC +T —2)= 
RCT — RC -T +1. 


10.4 Tests of Homogeneity 


Solutions to Exercises. 


1. Table $.10.2 contains the expected cell counts. The value of the x? statistic is Q = 18.8, which should 


Table $.10.2: Expected cell counts for Exercise 1 of Sec. 10.4. 


Popularity 
Rural 0 
Suburban H5 
Urban 52.5 


be compared to the y? distribution with four degrees of freedom. The tail area is 8.5 x 107. 
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. The value of the statistic Q given by Eqs. (10.4.3) and (10.4.4) is 7.57. If Q has a x? distribution with 


(2 — 1)(3 — 1) = 2 degrees of freedom, then Pr(Q > 7.57) < 0.025. 


. The value of the statistic Q given by Eqs. (10.4.3) and (10.4.4) is 18.9. If Q has the y? distribution 


with (4 —1)(5 — 1) = 12 degrees of freedom, then the value of Pr(Q > 18.9) lies between 0.1 and 0.05. 


. The table to be analyzed is as follows: 


Person Hits Misses 


1 8 9 
2 4 12 
3 7 3 
4 13 ale 
5 10 6 


The value of the statistic Q given by Eqs. (10.4.3) and (10.4.4) is 6.8. If Q has the y? distribution with 
(5 — 1)(2— 1) = 4 degrees of freedom, then the value of Pr(Q > 6.8) lies between 0.1 and 0.2. 


. The correct table to be analyzed is as follows: 


Supplier Defectives Nondefectives 


1 1 14 
2 7 8 
3 7 8 


The value of Q found from this table is 7.2. If Q has the y? distribution with (3 —1)(2—1) = 2 degrees 
of freedom, then Pr(Q > 7.2) < 0.05. 


. The proper table to be analyzed is as follows: 


After demonstration 
Hit Miss 
Before Hit © 27 
demonstration Miss | 73 
35 = o6Dst—t 


Although we are given the marginal totals, we are not given the entries in the table. If we were told 
the value in just a single cell, such as the number of students who hit the target both before and after 
the demonstration, we could fill in the rest of the table. 


. The proper table to be analyzed is as follows: 


After meeting 
Favors A Favors B No preference 
Favors A 
Before Favors B 
meeting No preference 


Each person who attended the meeting can be classified in one of the nine cells of this table. If a speech 
was made on behalf of A at the meeting, we could evaluate the effectiveness of the speech by comparing 
the numbers of persons who switched from favoring B or having no preference before the meeting to 
favoring A after the meeting with the number who switched from favoring A before the meeting to one 
of the other positions after the meeting. 
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10.5 Simpson’s Paradox 


Solutions to Exercises 


1. If population IT has a relatively high proportion of men and population I has a relatively high proportion 
of women, then the indicated result will occur. For example, if 90 percent of population I] are men and 
10 percent are women, then the proportion of population II with the characteristic will be (.9)(.6) + 
(.1)(.1) = .55. If 10 percent of population I are men and 90 percent are women, then the proportion of 
population I with the characteristic will be only (.1)(.8) + (.9)(.3) = .35. 


2. Each of these equalities holds if and only if A and B are independent events. 


3. Assume that Pr(B|A) = Pr(B|A‘°). This means that A and B are independent. According to the law 
of total probability, we can write 


Pr(I|B) = Pr(I|AN B)Pr(A|B) + Pr(Z|A°N B) Pr(A‘|B) 
= Pr(I|AN B) Pr(A) + Pr(I|AoN B) Pr(A®), 


where the last equality follows from the fact that A and B are independent. Similarly, 
Pr(I|B°) = Pr(Z|AN B°) Pr(A) + Pr(J|A°n B®) Pr(A°). 


If the first two inequalities in (10.5.1) hold then the weighted average of the left sides of the inequalities 
must be larger than the same weighted average of the right sides. In particular, 


Pr(I|A NM B) Pr(A) + Pr(J|A® 9 B) Pr(A®) > Pr(T|AN B*°) Pr(A) + Pr(T|A®n B®) Pr(A’°). 
But, we have just shown that this last equality is equivalent to Pr(J|B) > Pr(Z|B°), which means that 
the third inequality cannot hold if the first two hold. 


4. Define A to be the event if that a subject is a man, A° the event that a subject is a woman, B the 
event that a subject receives treatment I, and B® the event that a subject receives treatment I. Then 
the relation to be proved here is precisely the same as the relation that was proved in Exercise 2 in 
symbols. 


5. Suppose that the first two inequalities in (10.5.1) hold, and that Pr(A|B) = Pr(A|B°), Then 


II 
ae) 
4 


Pr (I|B) I| AN B)Pr(A| B)+Pr(I| A°NB)Pr(A° | B) 
I | ANB’) Pr(A| B) +Pr(I| Aon B°) Pr(A° | B) 
I | AN B°) Pr(A| B°) + Pr(I| Aon B®) Pr (A? | B°) 


I | B®). 


II 
ae) 
4 


Hence, the final inequality in (10.5.1) must be reversed. 


6. This result can be obtained if the colleges that admit a relatively small proportion of their applicants 
receive a relatively large proportion of female applicants and the colleges that admit a relatively large 
proportion of their applicants receive a relatively small proportion of female applicants. As a specific 
example, suppose that the data are as given in the following table: 
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Proportion 
of total Proportion Proportion 
University Proportion Proportion of males of females 
College applicants male female admitted admitted 

1 Ll 9 mal .o2 06 

2 1 9 Ll 32 06 

3 2 8 2 32 .56 

4 2 8 2 32 .56 

5 4 1 9 05 10 


This table indicates, for example, that College 1 receives 10 percent of all the applications submitted 
to the university, that 90 percent of the applicants to College 1 are male and 10 percent are female, 
that 32 percent of the male applicants to College 1 are admitted, and that 56 percent of the female 
applicants are admitted. It can be seen from the last two columns of this table that in each college the 
proportion of females admitted is larger than the proportion of males admitted. However, in the whole 
university, the proportion of males admitted is 


(.1)(.9)(.32) + (.1)(.9)(.32) + (.2)(.8)(. 
1 


and the proportion of females admitted is 


(-D(-D(56) + CD C1)-56) + (-2)(2) )(-2)(.56) + (-4)(-9)(10) _ 
(ICL) + C1)C1) + 62)¢- 2) + (.4)(-9) _ 


7. (a) Table $.10.3 shows the proportions helped by each treatment in the four categories of subjects. 
The proportion helped by Treatment IT is higher in each category. 


2)(.56) + (2 
)(.2) + (.2)¢ 


Table S.10.3: Table for Exercise 7a in Sec. 10.5. 
Proportion helped 


Category Treatment I Treatment IT 
Older males .200 .667 
Younger males 750 .800 
Older females 167 .286 
Younger females 500 .640 


(b) Table $.10.4 shows the proportions helped by each treatment in the two aggregated categories. 
Treatment I helps a larger proportion in each of the two categories 


Table S.10.4: Table for Exercise 7b in Sec. 10.5. 
Proportion helped 


Category Treatment I Treatment II 
Older subjects 433 .400 
Younger subjects 700 .667 


(c) When all subjects are grouped together, the proportion helped by Treatment I is 200/400 = 0.5, 
while the proportion helped by Treatment IT is 240/400 = 0.6. 
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10.6 Kolmogorov-Smirnov Tests 


Commentary 


This section is optional. However, some of the topics discussed here are useful in Chapter 12. In particular, the 
bootstrap in Sec. 12.6 makes much use of the sample c.d.f. Some of the plots done after Markov chain Monte 
Carlo also make use of the sample c.d.f. The crucial material is at the start of Sec. 10.6. The Glivenko-Cantelli 
lemma, together with the asymptotic distribution of the Kolmogorov-Smirnov test statistic in Table 10.32 
are useful if one simulates sample c.d.f.’s and wishes to compute simulation standard errors for the entire 
sample c.d.f. 

Empirical c.d.f.’s can be computed by the R function ecdf. The argument is a vector of data values. The 
result is an R function that computes the empirical c.d.f. at its argument. For example, if x has a sample 
of observations, then empd.x=ecdf (x) will create a function empd.x which can be used to compute values 
of the empirical c.d.f. For example, empd.x(3) will be the proportion of the sample with values at most 3. 
Kolmogorov-Smirnov tests can be performed using the R function ks.test. The first argument is a vector of 
data values. The second argument depends on whether one is doing a one-sample or two-sample test. In the 
two-sample case, the second argument is the second sample. In the one-sample case, the second argument 
is the name of a function that will compute the hypothesized c.d.f. If that function has any additional 
arguments, they can be provided next or named explicitly later in the argument list. 


Solutions to Exercises. 


1. F,(a) = 0 for x < y, and F,,(y1) = 0.2. Suppose first that F'(y,) > 0.1. Since F is continuous, the values 
of F(x) will be arbitrarily close to F'(y;) for x arbitrarily close to y,. Therefore, sup |F;,(”) — F(a)| = 
U<Y1 


F(yi) > 0.1, and it follows that D, > 0.1. Suppose next that F(y,) < 0.1. Since F,,(y1) = 0.2, it 
follows that | F,(y1) — F(y1) | > 0.1. Therefore, it is again true that D,, > 0.1. We can now conclude 
that it must always be true that D, > 0.1. If the values of F'(y;) are as specified in the second part of 
the exercise, fori =1,...,5, then: 


F(x) <0.1 for 2 < 1; 
0.2-0.1=0.1 for x = yj, 
| F(z) -0.2|<0.1 for yi <2 < yo, 


| Bake) =F) =4 G4 38 = 0 for x = yp, 
| F(z) -0.4|<0.1 for y< 2 < ys, 
etc. 
Hence, D, = sup | F(z) — F(x) | =0.1. 


-wo<2r<ow 


0 forrz<y, 
0.2 for y <2 < y, 
0.4 for yg <2 < y, 
0.6 for y3 <2 < ya, 
0.8 fory,<2< ys, 
1 for x > ys. 


2. Flt) = 


If F satisfies the inequalities given in the exercise, then | F,(%) — F(x) | < 0.2 for every value of zx. 
Hence, D,, < 0.2. Conversely, if F'(y;) > 0.27 for some value of i, then F(a) — F,(x) > 0.2 for values 
of x approaching y; from below. Hence, D, > 0.2. Also, if F'(y;) < 0.2(i — 1) for some value of i, then 
Fn(yi) — F (yi) > 0.2. Hence, again D,, > 0.2. 
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3. The largest value of the difference between the sample c.d.f. and the c.d.f. of the normal distribution 
with mean 3.912 and variance 0.25 occurs right before « = 4.22, the 12th observation. For x just below 
4.22, the sample c.d.f. is F(x) = 0.48, while the normal c.d.f. is @([4.22 — 3.912]/0.5) = 0.73. The 
difference is D* = 0.25. The Kolmogorov-Smirnov test statistic is 23!/? x 0.25 = 1.2. The tail area can 
be found from Table 10.32 as 0.11. 


4. When the observations are ordered, we obtain Table $.10.5. The maximum value of | F,,(x) — F(z) | 


Table S.10.5: Table for Exercise 4 in Sec. 10.6. 


i w=F(y) Falu) i w=F) Filvi) 
1 .O1 04 14 Al 56 
2 06 08 15 A2 60 
3 08 2 16 48 64 
4 09 16 17 57 68 
3) Lill 20 18 66 72 
6 16 24 19 71 76 
7 22 28 20 75 80 
8 23 .o2 21 78 84 
9 29 36 22 79 88 

10 10) 40 23 82 92 

11 30 44 24 8 96 

12 38 48 25 90 1.00 

13 40 52 


occurs at x = y15 where its value is 0.60 — 0.42 = 0.18. Since n = 25, n!/2 D,,* = 0.90. From 
Table 10.32, H(0.90) = 0.6073. Therefore, the tail area corresponding to the observed value of D,,* is 
1 — 0.6073 = 0.3927. 


5. Here, 


“0 for 0< 2 < 1/2, 
F(z) = 1 
gi +2) for3<a<l. 


Therefore, we obtain Table $.10.6. The supremum of | F,,(”) — F(x) | occurs as x > yg from below. 


Table S.10.6: Table for Exercise 5 in Sec. 10.6. 


i ys Fly) Ful) i yt Flys) Ful) 
1 O01 .015 04 14. «Al 615 56 
2 .06 .09 .08 15 42 .63 60 
3 .08 .12 12 16 .48 .72 64 
4 09 .135 16 17.57 = .785 68 
5 11 .165 .20 18 66 .83 72 
6 .16 .24 24 19 .71  .855 76 
T 22 33 .28 20. .75~— 875 80 
8 .23 .345 .32 21 = .78 ~ = .89 84 
9 29 1.435 .36 22 = .79 ~~ 895 88 
10.380 .45 40 23 «.82 91 92 
11 .85 = .525 44 24 =.88 96 
12 .88 .57 A8 25 .90 .95 1.00 
13. .40 .60 52 


Here, F(x) > 0.83 while F,,(x) remains at 0.68. Therefore, D,,* = 0.83 — 0.68 = 0.15. It follows that 
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n'/2-1D,,* = 0.75 and, from Table 10.32, H (0.75) = 0.3728. Therefore, the tail area corresponding to the 
observed value of D,,* is 1 — 0.3728 = 0.6272. 


. Since the p.d.f. of the uniform distribution is identically equal to 1, the value of the joint p.d.f. of the 
25 observations under the uniform distribution has the value L, = 1. Also, sixteen of the observations 
are less than 1/2 and nine are greater than 1/2. Therefore, the value of the joint p.d.f. of the observa- 
tions under the other distribution is Ly = (3/2)'®(1/2)° = 1.2829. The posterior probability that the 
observations came from a uniform distribution is 


1 

=i 
T = 0.438 
—L -[ 
5 4 + 9°72 


. We first replace each observed value x; by the value (x; — 26)/2. Then, under the null hypothesis, 
the transformed values will form a random sample from a standard normal distribution. When these 
transformed values are ordered, we obtain Table $.10.7. The maximum value of | F,(x) — ®(x) | 


Table S.10.7: Table for Exercise 7 in Sec. 10.6. 


i Yi by; Fry) i Yi O(yi)  Frlya) 
1 —2.2105 .0136 .02 26 —0.010 .4960 252 
2 —1.9265 .0270 04 27 —0.002 .4992 54 
3 —1.492 .0675 .06 28 1/40.010 .5040 .56 
4 —1.3295 .0919 08 29 1/40 1515 ~=.5602 58 
5 —1.309 0953 10 30 1/40 258 6018 60 
6 —1.2085 .1134 12 31 1/40 280 6103 62 
7 —1.1995 .1152 14 32 1/40 3075 = .6208 64 
8 —1.125 1307 16 33 1/40 398 6547 66 
9 —1.0775 .1417 18 34 1/40 4005 =.6556 68 
10 —1.052 1464 20 35 1/40 4245 .6645 70 
11 —0.961 1682 22 36 1/40 482 6851 72 
12 —0.8415  .2001 24 37 1/40 614 7304 74 
13. —0.784 2165 26 38 1/40 689 7546 76 
14. —0.767 2215 28 39 1/40 7165 = =.7631 78 
15 —0.678 2482 30 4O 1/40 7265 .7662 80 
16 —0.6285 .2648 32 Al 1/40 9262 = .8320 82 
17 —0.548 2919 34 42 /4 1.0645 .8564 84 
18 —0.456 3242 36 43 1/4 1.120 8686 86 
19 —0.4235  .3359 38 44 1/4 1.176 8802 88 
20 —0.340 3669 AO 45 4 1.239 8923 90 
21 —0.3245 .3728 42 46 4 1.4615 9281 92 
22 —0.309 3787 44 47 4 1.6315 .9487 94 
23 —0.266 3951 46 48 4 1.7925 .9635 96 
24 —0.078 4689 48 49 T/41 889 9705 98 
25. —0.0535 .4787 50 50 1/42 216 9866 1.00 


is attained at x = yo3 and its value is 0.0649. Since n = 50,n!/2D,* = 0.453. It follows from 
Table 10.32 that H(0.453) = 0.02. Therefore, the tail area corresponding to the observed value of D,,* 
is 1 — 0.02 = 0.98. 


. We first replace each observed value x; by the value (x; — 24)/2. Then, under the null hypothesis, 
the transformed values will form a random sample from a standard normal distribution. Each of the 
transformed values will be one unit larger than the corresponding transformed value in Exercise 7. 
The ordered values are therefore omitted from the tabulation in Table $.10.8. The supremum of 
| F(x) — ®(x) | occurs as x > yig from below. Here, ®(x2) — 0.7068 while F;,(x) remains at 0.34. 
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Table S.10.8: Table for Exercise 8 in Sec. 10.6. 


U B(yi) Fr(yi) a B(yi) Fr(yi) 
1 1130 .02 26 8389 52 
2 1779 .04 27 8408 54 
3 3114 .06 28 8437 56 
4 3710 .08 29 8752 58 
5 3787 10 30 8958 60 
6 4174 a? 31 8997 62 
7 4209 14 32 9045 64 
8 4502 16 33 9189 66 
9 4691 18 34 9193 68 
10 4793 .20 35 9229 70 
11 5136 .22 36 9309 72 
iL? 5630 24 37 9467 74 
13 5856 .26 38 9544 76 
14 5921 28 39 9570 78 
15 6263 30 40 9579 80 
16 6449 32 41 9751 82 
17 6743 34 42 9805 84 
18 7068 36 43 9830 86 
19 7178 238 44 9852 88 
20 7454 AO 45 9875 90 
21 7503 A2 46 9931 92 
22 7552 44 47 9958 94 
23 7685 46 48 9974 96 
24 8217 48 49 9980 98 
25 8280 50 50 9993 1.00 


Therefore, D,,* = 0.7068 — 0.34 = 0.3668. It follows that n!/?2D,* = 2.593 and, from Table 10.32, 
H (2.593) = 1.0000. Therefore, the tail area corresponding to the observed value of D,,* is 0.0000. 


. We shall denote the 25 ordered observations in the first sample by x1 < --- < x95 and shall denote 


the 20 ordered observations in the second sample by y, < --- < ya9. We obtain Table 8.10.9. The 
maximum value of | Fi,(x) — Gn(x) | is attained at x = —0.39, where its value is 0.32 — 0.05 = 0.27. 
Therefore, Dn = 0.27 and, since m = 25 and n = 20, (mn/[m + n))'/? Din = 0.9. From Table 10.32, 
H(0.9) = 0.6073. Hence, the tail area corresponding to the observed value of Dyn» is 1—0.6073 = 0.3927. 


We shall add 2 units to each of the values in the first sample and then carry out the same procedure 
as in Exercise 9. We now obtain Table $.10.10. The maximum value of | F(z) — Gn(x) | is attained 
at x = 1.56, where its value is 0.80 — 0.24 = 0.56. Therefore, Dmn = 0.56 and (mn/|m + n))'/? Da = 
1.8667. From Table 10.32, H(1.8667) = 0.998. Therefore, the tail area corresponding to the observed 
value of Dmmn is 1 — 0.998 = 0.002. 


We shall multiply each of the observations in the second sample by 3 and then carry out the same proce- 
dure as in Exercise 9. We now obtain Table $.10.11. The maximum value of | Fi,(#)—G,,(a) | is attained 
at « = 1.06, where its value is 0.80 — 0.30 = 0.50. Therefore, Dmm = 0.50 and (mn/[m+n])/? Dinn = 
1.667. From Table 10.32, H(1.667) = 0.992. Therefore, the tail area corresponding to the observed 
value of Dmn is 1 — 0.992 = 0.008. 


The maximum difference between the c.d.f. of the normal distribution with mean 3.912 and variance 
0.25 and the empirical c.d.f. of the observed data is D* = 0.2528 which occurs at the observation 4.22 
where the empirical c.d.f. jumps from 11/23 = 0.4783 to 12/23 = 0.5217 and the normal c.d.f. equals 
®([4.22 — 3.912]/0.5) = 0.7311. We now compare (23)!/2D* = 1.2123 to Table 10.32, where we find 
that H(1.2123) ~ 0.89. The tail area (p-value) is then about 0.11. 


0.27 


0.72 


1.18 
1.26 


1.44 


1.60 


Table S.10.9: Table for Exercise 9 in Sec. 10.6. 
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Table S.10.10: Table for Exercise 10 in Sec. 10.6. 
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Table §.10.11: Table for Exercise 11 in Sec. 10.6 
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—2.13 
—1.11 
—0.90 
—0.81 


0.00 


uy 


2.36 


COO RP AN NE 
WORBORAUHAO HO 
NWBSMHOKRM CO 


‘So |d0!o0! I 
Loom 


CO 
nx 


Fe ee ee EG (Cone CO CO Ce | 
SSOSeOeO@O@Ouowwowoc 
SOOO COCO OSAAANW HOC CO 


1.00 


Section 10.7. Robust Estimation 333 


10.7 Robust Estimation 


Commentary 


In recent years, interest has grown in the use of robust statistical methods. Although many robust methods 
are more suitable for advanced courses, this section introduces some robust methods that can be understood 
at the level of the rest of this text. This includes M-estimators of a location parameter. 

The software R contains some functions that can be used for robust estimation. The function quantile 
computes sample quantiles. The first argument is a vector of observed values. The second argument is a vector 
of probabilities for the desired quantiles. For example quantile(x,c(0.25,0.75)) computes the sample 
quartiles of the data x. The function median computes the sample median. The function mad computes the 
median absolute deviation of a sample. If you issue the command library (MASS), some additional functions 
become available. One such function is huber, which computes M-estimators as on page 673 with o equal 
to the median absolute deviation. The first argument is the vector of data values, and the second argument 
is k, in the notation of the text. To find the M-estimator with a general oc, replace the second argument by 
ko divided by the mean absolute deviation of the data. 


Solutions to Exercises. 


1. The observed values ordered from smallest to largest are 2.1, 2.2, 21.3, 21.5, 21.7, 21.7, 21.8, 22.1, 22.1, 
22.2, 22.4, 22.5, 22.9, 23.0, 63.0. 


(a) The sample mean is the average of the numbers, 22.17. 
(b) The trimmed mean for a given value of & is found by dropping k values from each end of this 
ordered sequence and averaging the remaining values. In this problem we get 
k 1 2 3. «4 
kth level trimmed mean | 20.57 22.02 22 22 
(c) The sample median is the middle observation, 22.1. 


(d) The median absolute deviation is 0.4. Suppose that we start iterating with the sample average 
22.17. The 7th and 8th iterations are both 22. 


2. The observed values ordered from smallest to largest are —2.40,—2.00, —0.11, 0.00, 0.03, 0.10, 0.12, 
0.23, 0.24, 0.24, 0.36, 0.69, 1.24, 1.78. 


(a) The sample mean is the average of these values, 0.0371. 
(b) The trimmed mean for a given value of & is found by dropping k values from each end of this 
ordered sequence and averaging the remaining values. In this problem we get 
k 1 2 3 4 
kth level trimmed mean | 0.095 0.19 0.165 0.16 
(c) Since the number of observed values is even, the sample median is the average of the two middle 
values 0.12 and 0.23, which equals 0.175. 


(d) The median absolute deviation is 0.18. Suppose that we start iterating with the sample average 
0.0371. The 9th and 10th iterations are both 0.165. 


3. The distribution of 6.5 » will be approximately normal with mean 6 and standard deviation 1/[2n1/? f (6)]. 
In this exercise, 
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Hence, f(@) = 1/V27. Since n = 100, the standard deviation of the approximate distribution of O5n 
is V27/20 = 0.1253. It follows that the distribution of Z = (6.5, — @)/0.1253 will be approximately 
standard normal. Thus, 


7 al 
Pr(|6.5,n — 6| < 0.1) = Pr (iz! Z es = Pr(|Z| < 0.798) = 2©(0.798) — 1 = 0.575. 
. Here, 
1 
f(z) = m1 + (@ —0)2] 


Therefore, f(@) = 1/m and, since n = 100, it follows that the distribution of O5.n will be approximately 
normal with mean @ and standard deviation 7/20 = 0.1571. Thus, the distribution of Z = (65 — 
0) /0.1571 will be approximately standard normal. Hence, 


0.1 
0.1571 


Pr(|4.5,n —0| < 0.1) =Pr (121 < ) = Pr(|Z| < 0.637) = 26(0.637) — 1 = 0.476. 


. Let the first density on the right side of Eq. (10.7.1) be called h. Since both h and g are symmetric 


with respect to s1, so also is f(x). Therefore, both the sample mean X,, and the sample median X,, are 
unbiased estimators of yu. It follows that the M.S.E. of X,, is equal to Var(X,,) and that the M.S.E. of 


Xp, is equal to Var(X,,). The variance of a single observation X is 


var(x) = f° @—y)?fle)dx 


= 5 fw Ph(ajde + 5 fw n)Pola)ar 
iL 1 5 
= 5 (+5 @=5. 


Since n = 100, Var(X,) = (1/100)(5/2) = 0.025. 
The variance of X;, will be approximately 1/[4nh?()]. Since 


h(x) = exp |-F(e~n)*| and g(x) = 


: ex | : (a — | 
2 /on Pp 5) bL ; 
it follows that 
1 1 1 1 1 1 3 
= —h al S53 St es : 
Fu) = shu) + 594) = 5° eta ae = Tae 


Therefore, Var(X,,) is approximately 27/225 = 0.028. 


. Let gn(a) be the joint p.d.f. of the data given that they came from the uniform distribution, and 


let f(x) be the joint p.d.f. given that they come from the p.d.f. in Exercise 5. According to Bayes’ 
theorem, the posterior probability that they came from the uniform distribution is 


snl) 
sane) + 5 fu (@) 


It is easy to see that g,(x#) = 1 for these data, while f,(a) = (3/2)!°(1/2)9 = 1.283. This makes the 
posterior probability of the uniform distribution 1/2.283 = 0.4380. 


% 


8. 


10. 


Section 10.7. Robust Estimation 335 


(a) The mean X,, is the mean of each X;. Since f(x) is a weighted average of two other p.d.f.’s, 
the { «f(x)dzx is the same mixture of the means of the other two distributions. Since each of the 
distributions in the mixture has mean sj, so does the distribution with p.d.f. f. 


(b) The variance X,, is 1/n times the variance of X;. The variance of X; is E(X?) — y?. Since the 
p.d.f. of X; is a weighted average of two other p.d.f.’s, the mean of X? is the same weighted average 
of the two means of X? from the two p.d.f.’s. The mean of X? from the first p.d-f. (the normal 
distribution with mean p and variance o”) is w? +07. The mean of X? from the second p.d-f. (the 
normal distribution with mean pz and variance 10007) is u? + 10007. The weighted average is 


(1 — )(u? + 07) + €(u? + 10007) = pe? + 07 (1 + 99). 
The variance of X; is then (1+ 99e)o?, and the variance X,, is (1 + 99e)o?/n. 


When ¢€ = 1, the distribution whose p.d.f. is in Eq. (10.7.2) is the normal distribution with mean 
and variance 10007. When ¢ = 0, the distribution is the normal distribution with mean js and variance 
a”. The ratio of the variances of the sample mean and sample median from a normal distribution 
does not depend on the variance of the normal distribution, hence the ratio will be the same whether 
the variance is 0? or 100c%. The reason that the ratio doesn’t depend on the variance of the specific 
normal distribution is that both the sample mean and the sample median have variances that equal 


the variance of the original distribution times constants that depend only on the sample size. 


. The likelihood function is 


1 1 


i=l 


It is easy to see that, no matter what o equals, the M.L.E. of @ is the number that minimizes 5 |x; —9]. 


w=1 
n 


This is the same as the number that minimizes b> |x; — 6|/n. The value ? |x; — 6|/n is the mean of 


i=1 i=1 
|X — 0| when the c.d.f. of X is the sample c.d.f. of X1,...,X;,. The mean of |X — 6| is minimized by 
@ equal to a median of the distribution of X according to Theorem 4.5.3. The median of the sample 
distribution is the sample median. 


The likelihood was given in Exercise 9. The logarithm of the likelihood equals 
1 nm 
—n log(20) — 7 x |x; — OI. 
i=1 


For convenience, assume that 171 < tg <... < %». Let 0 be a given number between two consecutive 
x; values. In particular, let x, < @ < x41. For known a, the likelihood can be written as a constant 
plus a constant times 


n k 
S- xi — (n—k)6 ae: + ké 
i=k+1 i=1 


For 6 between x, and x;,41, the derivative of this is k — (n — k), the difference between the number of 
observations below 6 and the number above @. 
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Let xq be the q quantile of X. The result will follow if we can prove that the q quantile of aX + b is 
at, + b. Since 


Pr(aX +b < atg+b) = Pr{X < zy), 


for all a > 0 and 6 and gq, it follows that ax, + b is the q quantile of aX + b. 


According to the solution to Exercise 11, the median of aX +b is am-+b, where m is the median of X. 
The median absolute deviation of X is the median of |X — m|, which equals o. The median absolute 
deviation of aX + 6 is the median of |aX + b — (am + b)| = a|X — m|. According to the solution to 
Exercise 11, the median of a|X — m| is a times the median of |X — ml, that is, ac. 


The Cauchy distribution is symmetric around 0, so the median is 0, and the median absolute deviation 
is the median of Y = |X|. If F is the c.d.f. of X, then the c.d.f. of Y is 


G(y) = Pr(Y < y) = Pr(|X| < y) = Pr(-y < X < y) = Fy) — F(-y), 


because X has a continuous distribution. Because X has a symmetric distribution around 0, F(—y) = 
1— F(y), and G(y) = 2F(y) — 1. The median of Y is where G(y) = 0.5, that is 2F'(y) — 1 = 0.5 or 
F(y) = 0.75. So, the median of Y is the 0.75 quantile of X, namely y = 1. 


(a) The c.d.f. of X is F(x) = 1—exp(—2\), so the quantile function is F~'(p) = —log(1—p)/X. The 
IQR is 

_ log(0.25) i log(0.75) __ log(3) 

r A 

(b) The median of X is log(2)/A, and the median absolute deviation is the median of |X —log(2)/Al]. It 
is the value x such that Pr(log(2)/A-—a < X < log(2)/A+z) = 0.5. If we try letting x = log(3)/[2)] 
(half of the IQR), then 

Pr(log(2)/A—a < X <log(2)/+2) = [1 —exp(—log(2v3))] — [1 — exp(—log(2/V3))] 


= 5v3 — 1/v3] = 0.5773. 


F-1(0.75) — F-1(0.25) = 


This is greater than 0.5, so the median absolute deviation is smaller than 1/2 of the IQR. 
(a) The quantile function of the normal distribution with mean jy and variance o? is the inverse of the 
c.d.f., F(x) = ®([x — p]/o). So, 
Fp) = +o8}(p). (8.10.1) 
The IQR is 
F~1(0.75) — F~1(0.25) = o[6~1(0.75) — ®-1(0.25)]. 


Since the standard normal distribution is symmetric around 0, ®~!(0.25) = —®~1(0.75), so the 
IQR is 206~1(0.75). 


(b) Let F be the c.d.f. of a distribution that is symmetric around its median jz. The median absolute 
deviation is then the value x such that F(u +2) — F(u— ax) = 0.5. By symmetry around the 
median, we know that F(u—2) = 1—F (w+), so x solves 2F(w+2)—1=0.5 or F(w+2) = 0.75. 
That is, c = F~!(0.75) — yu. For the case of normal random variables, use Eq. (S.10.1) to conclude 
that 2 =o0—*(0.75). 
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16. Here are the sorted values from smallest to largest: 
—67, —48, 6, 8, 14, 16, 23, 24, 28, 29, 41, 49, 56, 60, 75. 


(a) The average is 20.93. 

(b) The trimmed means are 
k 1 2 3.C«Ad 
Trimmed mean | 25.54 26.73 25.78 25 


(c) The sample median is 24. 


(d) The median absolute deviation divided by 0.6745 is 25.20385. Starting at 0 = 0 and iterating the 
procedure described on page 673 of the text, we get the following sequence of values for 0: 


20.805, 24.017, 26.278, 24.342, 24.373, 24.376, 24.377, 24.377,... 
After 9 iterations, the value stays the same to 7 significant digits. 
17. Let y stand for the median of the distribution, and let w+ c¢ be the 0.75 quantile. By symmetry, the 
0.25 quantile is »—c. Also, f(fs+c) = f(u—c). The large sample joint distribution of the 0.25 and 
0.75 sample quantiles is a bivariate normal distribution with means y—c and w+ c, variances both 


equal to 3/[16nf (p+ c)?], and covariance 1/[16nf (+ c)?]. The IQR is the difference between these 
two sample quantiles, so its large sample distribution is normal with mean 2c and variance 


3 3 1 il 
ienfluse?  lenfer ienteea! dag? 


10.8 Sign and Rank Tests 


Commentary 


This section ends with a derivation of the power function of the Wilcoxon-Mann-Whitney ranks test. This 
derivation is a bit more technical than the rest of the section and is perhaps suitable only for the more 
mathematically inclined reader. 

If one is using the software R, the function wilcox.test performs the Wilcoxon-Mann-Whitney ranks 
test. The two arguments are the two samples whose distributions are being compared. 


Solutions to Exercises. 


1. Let W be the number of (X;,Y;) pairs with X; < Y;. Then W has a binomial distribution with 
parameters n and p. To test Ho, we reject Ho if W is too large. In particular, if c is chosen so that 


i 1\” "fn 1\” 
(0) G) << (0) G)" 
w=ctl1 ” wae \W 

then we can reject Ho if W > c for a level apo test. 


2. The largest difference between the two sample c.d.f.’s occurs between 2.336 and 2.431 and equals 
|0.8 — 0.125] = 0.675. The test statistic is then 


— 
8+ 10 


The tail area is between 0.0397 and 0.0298. 


1/2 
) 0.675 = 1.423. 
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3. This test was performed in Example 9.6.5, and the tail area is 0.003. 


4. By ordering all the observations, we obtain Table $.10.12. The sum of the ranks of 21,...,x95 is 


Table S.10.12: Table for Exercise 4 of Sec. 10.8. 


Observed Observed 
Rank value Sample | Rank value Sample 

1 0.04 x 21 1.01 Yy 
2 0.13 x 22 1.07 

3 0.16 x 23 1.12 z 
4 0.28 x 24 1.15 x 
5 0.35 x 25 1.20 na 
6 0.39 x 26 1.25 Yy 
7 0.40 zc 27 1.26 y 
8 0.44 x 28 ToL Yy 
9 0.49 x 29 1.38 i 
10 0.58 x 30 1.48 y 
11 0.68 Yy 31 1.50 x 
12 0.71 x 32 1.54 x 
13 0.72 x 33 1.59 y 
14 0.75 x 34 1.63 y 
15 0.77 x 35 1.64 x 
16 0.83 x 36 1.73 x 
17 0.86 y 37 1.78 y 
18 0.89 y 38 1.81 y 
19 0.90 x 39 1.82 y 
20 0.91 x AO 1.95 Yy 


S = 399. Since m = 25 and n= 15, it follows from Eqs. (10.8.3) and (10.8.4) that E(S) = 512.5, 
Var(S) = 1281.25, and o = (1281.25)!/ = 35.7946. Hence, Z = (399 — 512.5)/35.7946 = —3.17. It can 
be found from a table of the standard normal distribution that the corresponding two-sided tail area is 
0.0015. 


. Since there are 25 observations in the first sample, F;,,(a) will jump by the amount 0.04 at each observed 


value. Since there are 15 observations in the second sample, G,,() will jump by the amount 0.0667 at 
each observed value. From the table given in the solution to Exercise 4, we obtain Table $.10.13. It 
can be seen from this table that the maximum value of |F;,(x) — Gy(x)| occurs when zx is equal to the 
observed value of rank 16, and its value at this point is .60 — .0667 = .5333. Hence, Dy = 0.5333 and 


mn \'? B7a\ 
( Dg = (=) (0.5333) = 1.633. It is found from Table 10.32 that the corresponding 


m+n 
tail area is almost exactly 0.01. 


. It is found from the values given in Tables 10.44 and 10.45 that F = 37?2, a;/25 = 0.8044, 7 = 


Wi21 yi/15 = 1.3593, $2 = 072, (2; — Z)? = 5.8810, and $2 = yj21(y; — 7)? = 2.2447. Since m = 25 
and n = 15, it follows from Eq. (9.6.3) that U = —3.674. It can be found from a table of the t 
distribution with m+n — 2 = 38 degrees of freedom that the corresponding two-sided tail area is less 


than 0.01. 


. We need to show that F(6 + G~1!(p)) = p. Compute 


~1(p) —1(p) ~'(p) 
PO+e =f" seyae= [gear = [yay = C(O") =p. 


—oo —co —CO 


where the third equality follows by making the change of variables y = x — 0. 


10. 


Lk 


12. 
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Table S.10.13: Table for Exercise 5 of Sec. 10.8. 


Rank of Rank of 
observations Fj,(x) G(x) | observations F(z) Gp(x) 
1 04 0 21 .68 .2667 
2 .08 0 22 68 £3333 
3 12 0 23 72 13339 
4 16 0 24 76 13339 
9) 20 0 25 .80 13333 
6 24 0 26 .80 .4000 
7 28 0 27 .80 .4667 
8 .o2 0 28 .80 .0333 
9 36 0 29 84 .9333 
10 40 0 30 84 6000 
11 40 .0667 31 88 6000 
12 44 .0667 32 92 .6000 
13 A8 .0667 33 92 .6667 
14 52 0667 34 92 7333 
15 56 0667 35 96 7333 
16 60 0667 36 1.00 7333 
17 60 1333 37 1.00 8000 
18 60 2000 38 1.00 8667 
19 64 2000 39 1.00 9333 
20 68 2000 4O 1.00 1.0000 


. Since Y +0 and X have the same distribution, it follows that if @ > 0 then the values in the first sample 


will tend to be larger than the values in the second sample. In other words, when 6 > 0, the sum S of 
the ranks in the first sample will tend to be larger than it would be if 6 = 0 or 0 < 0. Therefore, we 
will reject Ho if Z > c, where Z is as defined in this section and c is an appropriate constant. If we 
want the test to have a specified level of significance ap (0 < ag < 1), then c should be chosen so that 
when Z has a standard normal distribution, Pr(Z > c) = ao. It should be kept in mind that the level 
of significance of this test will only be approximately ao because for finite sample sizes, the distribution 
of Z will only be approximately a standard normal distribution when 6 = 0. 


. To test these hypotheses, add 69 to each observed value y; in the second sample and then carry out the 


Wilcoxon-Mann-Whitney procedure on the original values in the first sample and the new values in the 
second sample. 


For each value of 69, carry out a test of the hypotheses given in Exercise 7 at the level of significance 
1—a. The confidence interval for @ will contain all values of 49 for which the null hypothesis Hp would 
be accepted. 


Let ry < rg < +--+ < rm denote the ranks of the observed values in the first sample, and let X;, < Xi, < 
--+ < X;,, denote the corresponding observed values. Then there are r; — 1 values of Y in the second 
sample that are smaller than X;,. Hence, there are r; — 1 pairs (X;,,Y;) with X;, > Yj. Similarly, 
there are rg — 2 values of Y in the second sample that are smaller than X;,. Hence, there are rg — 2 
pairs (X;,,Y;) with X;, > Y;. By continuing in this way, we see that the number U is equal to 


m m 


(1-1) + (2-2) $+ + rm —m) = Doe — d= 8 — mm + 0). 
i=1 i=1 


Using the result in Exercise 11, we find that E(S) = E(U)+m(m-+1)/2, where U is defined in Exercise 11 
to be the number of (X;, Y;) pairs for which X; > Y;. So, we need to show that E(U) = nm Pr(X, > Y}). 
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We can let Z;; =1 if X; > Y; and Z;,; = 0 otherwise. Then 
ayy Ae (S.10.2) 
tJ 
and 
m n 
— » 2 Zij) 


Since all of the X; are i.i.d. and all of the Y; are i.i.d., it follows that E(Z;,;) = E(21,1) for all i and j. 
Of course E(Z11) = Pr(Xy > Fil, sO E(U) = mn Pr( X41 > Y}). 


Since S and U differ by a constant, we need to show that Var(U) is given by Eq. (10.8.6). Once again, 
write 


U= > 4s, 
a 9 
where Z;; = 1 if X; > Y; and Z;,; = 0 otherwise. Hence, 


Var(U Daa (43) x. Cov Zag 2g): 
9G )AGI) 


The first sum is mn[Pr(X1 > Y) — Pr(X1 > Y1)?]. The second sum can be broken into three parts: 


e The terms with i’ =i but 7’ # j. 
e The terms with j’ = j but i! £1. 
e The terms with both 7’ 47 and j’ 4 j. 


For the last set of terms Cov(Z;;, Zj:j") = 0 since (X;, Y;) is independent of (Xj, Yj). For each term 
in the first set 


E\ 4;,74;,9°) = Pr(X = ¥i,X1 = Yo), 

so the covariances are 
Cov(Zi,;, Zi,j’) = Pr(X1 > Vi, X1 > Yo) — Pr(X1 > V1)’. 

There are mn(n — 1) terms of this sort. Similarly, for the second set of terms 
Cov(Zy 5, Zij) = Pr(X1 > Vi, X2 > ¥1) — Pr(X1 > V1)’. 

There are nm(m — 1) of these terms. The variance is then 


nm [Pr(X1 > Yi) + (n— 1) Pr( > V1, X1 > Yo) 
+(m—1)Pr(X1 > i, X2>Vi) —(m+n—1)Pr(X > V4)? ]. 
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14. When F' = G, Pr(X1 > Yi) = 1/2, so Eq. (10.8.5) yields 


mn  m(m+1 mim+n+1 
age 


which is the same as (10.8.3). When F' =G, 
Pry > 1,24 > Yo) = 1/3 = Pri > 1, Xo > Vi), 


so the corrected version of (10.8.6) yields 


1 1 1 
nm ae Iot(m+n 2)3 = — [6 — 3m — 3n + 3+ 4m + 4n — 8} 
mn(m+n-+ 1) 


12 ; 
which is the same as (10.8.4). 
15. (a) Arrange the observations so that |D,| < --- < |D,|. Then D; > 0 if and only if X; > Y; if and 
nm 


only if W; = 1. Since rank 7 gets added into Sw if and only if D; > 0, we see that S iW; adds 
i=1 
just those ranks that correspond to positive D;. 
(b) Since the distribution of each D; is symmetric around 0, the magnitude |Dj,|,...,|D,| are inde- 
pendent of the sign indicators W1,...,W,. Using the result of part (a), if we assume that the |D;| 
n 


are ordered from smallest to largest, E (Sw) = 2 iE(W;). Since the |D,| are independent of the 
i=1 
W;, we have E(W;) = 1/2 even after we condition on the |D;| being arranged from smallest to 
n 


largest. Since yi =n(n+1)/2, we have E(Sw) = n(n + 1)/4. 
i=1 
(c) Since the W; are independent before we condition on the |D;| and they are independent of the |Dj|, 
n 


then the W; are independent conditional on the |D;|. Hence, Var(Sw) = ys Var(W;). Since 
i=1 


Var(W;) = 1/4 for all 7 and bwes = n(n+1)(2n + 1)/6, we have Var(Sw) = n(n + 1)(2n + 1)/24. 
i=1 


16. For i=1,...,15, let 
D; = (thickness for material A in pair i) — (thickness for material B in pair i). 


(a) Of the 15 values of D;, 10 are positive, 3 are negative, and 2 are zero. If we first regard the 
zeroes as positive, then there are 12 positive differences with n = 15, and it is found from the 
binomial tables that the corresponding tail area is 0.0176. If we next regard the zeroes as negative, 
then there are only 10 positive differences with n = 15, and it is found from the tables that the 
corresponding tail area is 0.1509. The results are not conclusive because of the zeroes present in 
the sample. 

(b) For the Wilcoxon signed-ranks test, use Table $.10.14. Two different methods have been used. In 
Method (I), the differences that are equal to 0 are regarded as positive, and whenever two or more 
values of | D; | are tied, the positive differences D; are assigned the largest possible ranks and the 
negative differences D; are assigned the smallest ranks. In Method (II), the differences that are 0 
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Table $.10.14: Computation of Wilcoxon signed-ranks test statistic for Exercise 16b in Sec. 10.8. 


Method (1) Method (1) Method (II) Method (II) 
Pair D; Rankof|D;| Signed rank Rank of |D;| Signed rank 


1 —0.8 6 —6 7 —7 
2 1.6 12 12 11 11 
3 —0.5 5 —5 5 a) 
4 0.2 3 3 3 3 
5 —1.6 11 —l1 13 —13 
6 0.2 4 4 4 4 
7 1.6 13 13 12 12 
8 1.0 9 9 9 9 
9 0.8 7 7 6 6 
10 0.9 8 8 8 8 
11 lef 14 14 14 14 
12 1.2 10 10 10 10 
13 1.9 15 15 15 15 
14 0 2 2 2 —2 
15 0 1 1 1 —1 


are regarded as negative, and among tied values of | D; |, the negative differences are assigned the 
largest ranks and the positive differences are assigned the smallest ranks. Let S;, denote the sum 
of the positive ranks. Since n = 15, E(S,,) = 60 and Var (S,,) = 310. Hence, a, = /310 = 17.607. 
For Method (I), S, = 98. Therefore, Z, = (98 — 60)/17.607 = 2.158 and it is found from a table 
of the standard normal distribution that the corresponding tail area is 0.0155. For Method (II), 
Sy, = 92. Therefore, Z, = 1.817 and it is found that the corresponding tail area is 0.0346. By 
either method of analysis, the null hypothesis would be rejected at the 0.05 level of significance, 
but not at the 0.01 level. 


(c) The average of the pairwise differences (material A minus material B) is 0.5467. The value of o’ 
computed from the differences is 1.0197, so the t statistic is 2.076, and the p-value is 0.0284. 


10.9 Supplementary Exercises 


Solutions to Exercises 


1. Here, ag/2 = 0.025. From a table of binomial probabilities we find that 


5 6 
20 20 

i? ( )o.s = 0.021 = 0025 < > ( )o.s = 0.058. 
x x 


So, the sign test would reject the null hypothesis that 6 = 00 if the number W of observations with values 
at most 09 satisfies either W < 5 or W > 20 —5. Equivalently, we would accept the null hypothesis 
if6 < W < 14. This, in turn, is true if and only if 0 is strictly between the sixth and fourteenth 
ordered values of the original data. These values are 141 and 175, so our 95 percent confidence interval 
is (141, 175). 
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2. It follows from Eq. (10.1.2) that 
(Ni — 80)? _ 2 2 
= N;7 — 2(80)(400) + 5( 80)? N; — 400. 
g= yo SRE = 5 | dn? — 2¢09(400) + 580)2| = 99 


w=1 


It is found from the x? distribution with 4 degrees of freedom that Hp should be rejected for Q > 13.28 
or, equivalently, for 37?_, N? > 80(413.28) = 33, 062.4. 


3. Under Ho, the proportion p? of families with i boys is as follows: 


ip? np? 
0 1/8 16 
1 3/8 48 
2 3/8 48 
3 1/8 16 


Hence, it follows from Eq. (10.1.2) that 


(26 — 16)? i (32 — 48)? “ (40 — 48)? rn (30 — 16)? 


SS = 25.1667. 
16 48 48 16 


Q= 


Under Ho, Q has the x? distribution with 3 degrees of freedom. Hence, the tail area corresponding to 
Q = 25.1667 is less than 0.005, the smallest probability in the table in the back of the book. It follows 
that Ho should be rejected for any level of significance greater than this tail area. 


4. The likelihood function of p based on the observed data is 
(q°)°° (3pq*)** (3p7q)"° (p?) = (const.) p°**q'*?, 


where g=1-—p. Hence, the M.L.E. of 6 based on these data is p = 202/384 = .526. Under Ho, the 
estimated expected proportion p? of families with i boys is as follows: 
i 4 npy 
0 g=.1065 13.632 
1 369?=.3545 45.376 
2 3f67¢ = .3935 50.368 
3 pe =.1455 18.624 


It follows from Eq. (10.2.4) that 


(26 — 13.632)? (32 — 45.376)? (40 — 50.368)? (30 — 18.624)? 


eS ee = 24.247. 
9 13.632 = 45.376 - 50.368 = 18.624 i 


Under Ho, Q has the x? distribution with 2 degrees of freedom. The tail area corresponding to 
Q = 24.247 is again less than 0.005. Ho should be rejected for any level of significance greater than 
this tail area. 


5. The expected numbers of observations in each cell, as specified by Eq. (10.4.4), are presented in the 
following table: 
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A B AB O 


Group 1 
Group 2 
Group 3 


It is now found from Eq. (10.4.3) that Q = 6.9526. Under the hypothesis Hp that the distribution is 
the same in all three groups, Q will have approximately the ,? distribution with 3 x 2 = 6 degrees of 
freedom. It is found from the tables that the 0.9 quantile of that distribution is 10.64, so Hp should 
not be rejected. 


. If Table 10.47 is changed in such a way that the row and column totals remain unchanged, then the 


expected numbers given in the solution of Exercise 5 will remain unchanged. If we switch one person 
in group 1 from B to AB and one person in group 2 from AB to B, then all row and column totals 
will be unchanged and the new observed numbers in each of the four affected cells will be further from 
their expected values than before. Hence, the value of the x? statistic Q as given by Eq. (10.4.3) is 
increased. Continuing to switch persons in this way will continue to increase Q. There are other similar 
switches that will also increase Q, such as switching one person in group 2 from O to A and one person 
in group 3 from A to O. 


‘ Na Nag \ 
(Mu — E11)? = (Mn as ae 
= [Mn 7 (Nii + Mi2)(Nii + aay 
n 


1 

2 [rir — (Nir + Mi2)(Ni + Nox)? 
1 
0) (Nii Naz — Ni2Na1)?, 

since n = Ny, + No; + Noi + Noo. Exactly the same value is obtained for (Ni2 — E,2)?, (No — En1)?, 
and (Noe _ Ep9)?. 


. It follows from Eq. (10.3.4) and Exercise 7 that 


a2 
1 1 
Q = S(Nu N22 — M2No1)? $0 So. 
" j= ja Pig 
But 
2 2 
ee 1 -_ n 4 n 4 n a n 
fal ja Ej Nit Nai ~ NigNie ~ No-Nia Noy Nie 
_ MNo+Ny42 + No4Na1 + Ni4Ny2 + M14N41) 
Ni4+No4N41N42 
Ni4+No4N41N42" 


since Nyy + No, = Ni, + Ny2 =n. Hence, Q has the specified form. 


10. 


11. 


12. 


13. 


14. 


15. 
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. In this exercise, Ni = No, = Nui = Nye = 2n and Ny Noo — Nig No, = (n +)? — (n — a)? = 4na. 


It now follows from Exercise 8 (after we replace n by 4n in the expression for Q) that Q = 4a?/n. 
Since Hp should be rejected if Q > 6.635, it follows that Ho should be rejected if a > (6.635n)!/?/2 or 
a < —(6.635n)!/? /2. 


In this exercise Ny4 = Noy = Ni, = Nyo =n and Ny No2— Ni2No1 = (2a—1)n?. Tt now follows from 
Exercise 8 (after we replace n by 2n in the expression for Q) that Q = 2n(2a — 1)”. Since Hp should 
be rejected if Q > 3.841, it follows that Ho should be rejected if either 


x if ich (esuy 
aaa: mn 

2 1[_ (esuy 
bla on 


Results of this type are an example of Simpson’s paradox. If there is a higher rate of respiratory diseases 


among older people than among younger people, and if city A has a higher proportion of older people 
than city B, then results of this type can very well occur. 


or 


Results of this type are another example of Simpson’s paradox. If scores on the test tend to be higher 
for certain classes, such as seniors and juniors, and lower for the other classes, such as freshmen and 
sophomores, and if school B has a higher proportion of seniors and juniors than school A, then results 
of this type can very well occur. 


The fundamental aspect of this exercise is that it is not possible to assess the effectiveness of the 
treatment without having any information about how the levels of depression of the patients would 
have changed over the three-month period if they had not received the treatment. In other words, 
without the presence of a control group of similar patients who received some other standard treatment 
or no treatment at all, there is little meaningful statistical analysis that can be carried out. We can 
compare the proportion of patients at various levels who showed improvement after the treatment with 
the proportion who remained the same or worsened, but without a control group we have no way of 
deciding whether these proportions are unusually large or small. 


If ¥, < Yo < Y3 are the order statistics of the sample, then Y2 is the sample median. For 0 < y < 1, 


Gly) = =Pr (¥2<y) 
= Pr(At least two obs. < y) 
= Pr(Exactly two obs. < y) + Pr(All three obs. < y) 
= 3(y’)?(1 — y") + (9%)? 
= 3y29 — 24/38, 


Hence, for 0 < y < 1. the p.d.f. of Yo is g(y) = G’(y) = 60(y29-! — y°8-1), 


The c.d.f. of this distribution is F(x) = 2°, so the median of the distribution is the point m such 
that m? = 1/2. Thus, m = (1/2)!/° and f(m) = 62'/°/2. It follows from Theorem 10.7.1 that the 
asymptotic distribution of the sample median will be normal with mean m and variance 


ae 
Anf2(m) 6222/0" 
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16. We know from Exercise 1 of Sec. 8.4 that the variance of the ¢ distribution is finite only for a@ > 2 and 


1%. 


its value is a/(a@ — 2). Hence, it follows from the central limit theorem that for a > 2, the asymptotic 
distribution of X,, will be normal with mean 0 and variance 


eee 
ala 2) 


Since the median of the ¢ distribution is 0, it follows from Theorem 10.7.1 (with n replaced by a) that 
the asymptotic distribution of X,, will be normal with mean 0 and variance 


ar (S) 
oi = 


7 An? (<=) 
2 


Thus, of < 09 if and only if 


oo) CH) 


eS 
3 avn i n/16 
a 
4 1 ok a 8/9 
a 4 
5 ae 2 2707 (16) = 1.04. 


‘This, ot < ae fora = 56,7, .:. 


As shown in Exercise 5 of Sec. 10.7, E(Xn) = E(Xn) = 0, so the M.S.E. of each of these estimators is 
equal to its variance. Furthermore, Var(X,,) = +[a-1+ (1 —a)o?] and 


1 


Yor) = SRO? 


where 


noi? = + (a+4+=8). 


20 
(a) For o? = 100, Var(X,) < Var(X,) if and only if 
507 
—___—__—_—_. 100(1 — a). 
farisoe cr ee 
Some numerical calculations show that this inequality is satisfied for .031 <a < .994. 


(b) For a= 4, Var(Xp) < Var(Xn) if and only if o < .447 or o > 1/.447 = 2.237. 


18. 


19. 


20. 


21. 


22. 
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The simplest and most intuitive way to establish this result is to note that for any fixed values 
YI SY2<06+ <S Yn, 


IY1s--->Yn)AY +++ Aun © 
Pr(yi < Yi < yr + Ayi,---5¥n < Yn < yn + Ayn) = 
Pr ees one observation in the interval (yj, yj; + Ay,;) for 7 =1,...,n) = 


n} TL (yj + Ay;) — F(yy)] © 


j=l 


j=l 
where the factor n! appears because there are n! different arrangements of X1,..., Xj, such that exactly 
one of them is in each of the intervals (y;,y; + Ayj), 7 =1,...,n. Another, and somewhat more 
complicated, way to establish this result is to determine the general form of the joint c.d.f. G(y1,.-., Yn) 
of Y1,..., Yn for yr < yo < +++ < Yn, and then to note that 
OMG Higscsa tp) 
G(Y15 ++ +5Yn) = S——— = a! f(y) +++ Fyn). 


Oy, +++ OYn 


It follows from Exercise 18 that the joint p.d.f. g(y1, y2, y3) = 3!, a constant, for 0 < y1 < yo < y3 <1. 
Since the required conditional p.d.f. of Y2 is proportional to g(y1, y2, y3), as a function of y2 for fixed y; 
and yz, it follows that this conditional p.d.f. is also constant. In other words, the required conditional 
distribution is uniform on the interval (yj, y3). 


We have Y, < 6 < Y;+3 if and only if at least r observations and at most r+ 2 observation are below 6 
Let X stand for the number of observations out of the sample of size 20 that are below 0. Then X has 
a binomial distribution with parameters 20 and 0.3. It follows that 


Pry, <8 < Vag l= Pres xX = 6+ 2). 


For each value of r, we can find this probability using a binomial distribution table or a computer. By 
searching through all values of r, we find that r = 5 yields a probability of 0.5348, which is the highest. 


As shown in Exercise 10 of Sec. 10.8, we add 09 to each observation Y; and then carry out the Wilcoxon- 
Mann-Whitney test on the sum S¢, of the ranks of the X;’s among these new values Y; + 60,..., Yn + 90. 
We accept Ho if and only if 


| Soy — E(S) | a 
[Var(sy2 ~ “ ¢ 7 5) 


where E(S) and Var(S) are given by (10.8.3) and (10.8.4). However, by Exercise 11 of Sec. 10.8, 


Lee +1). 


56, = Ug + 9 


When we make this substitution for Sg, in the above inequality, we obtain the desired result. 


We know from general principles that the set of all values 9) for which Hp would be accepted in 
Exercise 21 will from a confidence interval with the required confidence coefficient 1—a. But if Ug,, the 
number of differences X; — Y; that are greater than Oo, is greater than the lower limit given in Exercise 21 
then #9 must be less than B. Similarly, if Ug, is less than the upper limit given in Exercise 22, then 69 
must be greater than A. Hence, A < 6 < B is a confidence interval. 
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a) We know that 6, = b if and only if Pr(X < b) = p. So, let Y; = 1 if X; < b and Y; = 0 if not. 
p 


eH 


Then Yj,...,Y, are iid. with a Bernoulli distribution with parameter p if and only if Ho is true. 
Define W = >*"_, Y; To test Ho, reject Ho if W is too big or too small. For an equal tailed level 
ag test, choose two numbers c, < cg such that 


ian ao pang g 

Sot |prd-pr"’ <>< Sol |pta—p)”, 
w=0 ad 2 w=0 td 

a (") ag ” n 

ye pip” <<. ( )ra ye, 
W=C2 had 2 w=c2—-1 wy 


Then a level ap test rejects Ho if W <c, or W > cg. 


For each b, we have shown how to construct a test of Ho» : 6, = b. For given observed data 
X1,...,Xn find all values of b such that the test constructed in part (a) accepts Ho». The set 
of all such b forms our coefficient 1 — ag confidence interval. It is clear from the form of the test 
that, once we find three values bj < by < b3 such that Ho», is accepted and Hoy, and Ho», are 
rejected, we don’t have to check any more values of b < b, or b > bg since all of those would be 
rejected also. Similarly, if we find b4 < bs such that both Ho», and Ho», are accepted, the so are 
Ho» for all bg < b < bs. This will save some time locating all of the necessary b values. 


Chapter 11 


Linear Statistical Models 


11.1 The Method of Least Squares 


Commentary 


If one is using the software R, the functions 1lsfit and 1m will perform least squares. While lsfit has 
simpler syntax, 1m is more powerful. The first argument to 1sfit is a matrix or vector with one row for each 
observation and one column for each x variable in the notation of the text (call this x). The second argument 
is a vector of the response values, one for each observation (call this y). By default, an intercept is fit. To 
prevent an intercept from being fit, use the optional argument intercept=FALSE. To perform the fit and store 
the result in regfit, use regfit=lsfit(x,y). The result regfit is a “list” which contains (among other 
things) coef, the vector of coefficients {o,..., 3; in the notation of the text, and residuals which are defined 
later in the text. To access the parts of regfit, use regfit$coef, etc. To use 1m, regfit=lm(y~x) will 
perform least squares with an intercept and store the result in regfit. To prevent an intercept from being fit, 
use regfit=lm(y~x-1). The result of 1m also contains coefficients and residuals plus fitted. values 
which equals the original y minus residuals. The components of the output are accessed as above. 

The plot function in R is useful for visualizing data in linear models. In the notation above, suppose 
that x has only one column. Then plot(x,y) will produce a scatterplot of y versus x. The least-squares 
line can be added to the plot by lines(x,regfit$fitted.values). (If one used 1sfit, one can create the 
fitted values by regfit$fitted.values=y-regfit$residuals.) 


Solutions to Exercises 


1. First write cya; + co = c1(@j — Tn) + (C1B_ + cg) for every 7. Then 


2 


(cya, + ¢2)” = (ai —Fn)* + (an + c2)” +208; — By) (Cita +O). 


The sum over all 2 from 1 to n of the first two terms on the right produce the formula we desire. The 
sum of the last term over all i is 0 because c1(c1%p + c2) is the same for all i and S(f_j (a; — En) = 0. 


2. (a) The result can be obtained from Eq. (11.1.1) and the following relations: 


n n 
bE? a aU: _ Yn) = So (xiv — EnYi — YnXi + Pata) 
i=1 i=1 

n 


n n 
i=1 i=1 i=1 
n 
_ >. LiYi — NE nYn — NL Yn + NFnYn 
i=1 
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nm 
= >) 2iyi — nEnGn, and 
i=1 


n 


(x4 — En)* = > (x? — 25720; + z2) 


i Ma 
I 


i=1 
n nr 
2 = 2 
= ba — Dt, Yt + nXj, 
i=1 i=l 
nm 
= 2 m2 x2 
= EF — 2nzy, + nz, 
i=1 


n 
= 2 =2 
= y Li — NL: 
i=1 


(b) The result can be obtained from part (a) and the following relation: 


n n nm 


eee -_ En) (ys -_ Yn) = So (ai = En) Yi _ Vie -_ Dia) Ur 
i=l i=1 i=l 
n nm 
= Ke — En)Yi — Gn ye — Zn) 
i=l i=l 
n nm 
= YG: —Zn)yi, since DBE? —Z,) =0. 
i=1 i=1 
(c) This result can be obtained from part (a) and the following relation: 
n n n 
Yo(@i — En)(yi— Gn) = Do 2i(Yi-— Gn) — En D> (Yi — Jn) 
i=1 i=1 i=l 
n nm 
_ a os Us — Un) since sy — Gn) = 0. 
i=1 i=l 


3. It must be shown that y, = Bot BiEn- But this result follows immediately from the expression for Bo 


given in Eq. (11.1.1). 


. Since the values of G9 and ; to be chosen must satisfy the relation 0Q/08o = 0, it follows from Eq. 


(11.1.3) that they must satisfy the relation So (yi — 4) = 0. Similarly, since they must also satisfy 
i=1 , 
relation 0Q/08, = 0, it follows from Eq. (11.1.4) that So (yi — %)x; = 0. These two relations are 
i=1 
equivalent to the normal equations (11.1.5), for which 69 and (6; are the unique solution. 


. The least squares line will have the form x = yo + y1y, where yo and 7; are defined similarly to Bo and 


8, in Eq. (11.1.1) with the roles of x and y interchanged. Thus, 


nm 
x LiYi — NL nYn 
i=1 


and Yo = In — Vin. It is found that 4 = 0.9394 and 40 = 1.5691. Hence, the least squares line is 
x = 1.5691+0.9394 y or, equivalently, y = —1.6703+1.06452. This line and the line y = —0.786+0.685x 
given in Fig. 11.4 can now be sketched on the same graph. 
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6. The sum of the squares of the deviations from the least squares parabola is the minimum possible value 


n 
of So (yi — Bo — Pix; - Bon)? as 39,1, and $2 vary over all possible real numbers. The sum of the 
i=l 
squares of the deviations from the least squares line is the minimum possible value of this same sum 
when {> is restricted to have the value 0. This minimum cannot therefore be smaller than the first 


minimum. 


nm n 
T (a) Here, m= 8, 7, = 2.25, 9, = 42.125: So riyi = 764, and ys = 51. Therefore, by Eq. (11.1.1), 
i=1 i=1 
By = 0.548 and Bo = 40.893. Hence, the least squares line is y = 40.893 + 0.5482. 


(b) The normal equations (11.1.8) are found to be: 
880 + 1861 +5162 = 337, 
1869 + 518, + 16255 764, 
5169 + 1628; + 548.2565 2167.5. 
Solving these three simultaneous linear equations, we obtain the solution: 


Bo = 38.483, 6, =3.440, and By = —0.643. 


8. If the polynomial is to pass through the k + 1 given points, then the following k + 1 equations must be 
satisfied: 


Bo + Bitrt-+++ Bat = yn, 
Bo + Bite +-+++ Byxk = yo, 


Bo + Bitar t-++ + Beha, = Yet 


These equations form a set of k +1 simultaneous linear equations in $o,...,6;. There will be a unique 
polynomial having the required properties if and only if these equations have a unique solution. These 
equations will have a unique solution if and only if the (k+1) x (k+1) matrix of coefficients of 8o,..., Bx 
is nonsingular (i.e., has a nonzero determinant). Thus, it must be shown that 


1 Ti. x? ae 
1 rQ xe ge 
2 
det + 0. 
2 k 
1 Uk+1 Ue a“ Vet 


This determinant will be 0 if and only if the & + 1 columns of the matrix are linearly dependent; i.e., 


if and only if there exist constants a,,...,@%41, not all of which are 0, such that 
1 vy x? ak 0 
ay}: |t+az] : | +ag] i | +---+aey} : | = 
: Te+1 Tie he 0 
But if such constants exist, then the & + 1 distinct values x1,...,2%41 will all be roots of the equation 


2 k 
ay + agr + agx™ +--+ + ap412z". 


It is impossible, however, for a polynomial of degree k or less to have k + 1 distinct roots unless all the 
coefficients a,,...,@%41 are 0. It now follows that the determinant cannot be 0. 
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The normal equations (11.1.13) are found to be 


1089 + 11708; +188. = 1359, 
11706 + 138, 1008; + 21308. = 160,380, 
1889 + 21308, +388. = 2483. 


Solving these three simultaneous linear equations, we obtain the solution Bo = 3.7148, Br = 1.1013, and 
Bo = 1.8517. 


We begin by taking the partial derivative of the following sum with respect to {o, 31, and 62, respec- 
tively: 


2 
(yi — Bora — Bivi2 — Boxy)”. 


n 
71 


By setting each of these derivatives equal to 0, we obtain the following normal equations: 


n n nm n 
2 
Bo > ti t+ A>) titi + Bo >) tary = >> cayi, 
i=l i=l i=l i=l 
n nm nm n 
Bo Swi 2:2 + Br D> win + Bod a =S> cay, 
i=l i=l i=l i=l 


n nm n n 
2 3 4 2 
Bo S> LiyLig + By ba Lig + Bo Lig = > LiQVi- 
=i i=l i=1 i=1 


When the given numerical data are used, these equations are found to be: 


138, 1008 + 21308, + 455082 = 160,380, 
213089 +388; +908. = 2483, 
455089 +908; +2308. = 5305. 


Solving these three simultaneous linear equations, we obtain the solution Bo = 1.0270, By = 17.2934, 
and 6) = —4.0186. 


In Exercise 9, it is found that 
10 


> (Ce Bo — Bita — Boag)” = 102.28. 


i=l 


In Exercise 10, it is found that 


yw Bower Breen Box2,)? = 42.72. 


Therefore, a better fit is obtained in Exercise 10. 


Section 11.2. Regression Ba! 


11.2 Regression 


Commentary 


The regression fallacy is an interesting issue that students ought to see. The description of the regression 
fallacy appears in Exercise 19 of this section. The discussion at the end of the section on “Design of the 
Experiment” is mostly of mathematical interest and could be skipped without disrupting the flow of material. 

If one is using the software R, the variances and covariance of the least-squares estimators can be computed 
using the function 1s.diag or the function summary.1m. The first takes as argument the result of lsfit, and 
the second takes as argument the result of Im. Both functions return a list containing a matrix that can be 
extracted via $cov.unscaled. For example, using the notation in the Commentary to Sec. 11.1 above, if we 
had used 1lsfit, then morefit=ls.diag(regfit) would contain the matrix morefit$cov.unscaled. This 
matrix, multiplied by the unknown parameter 0”, would contain the variances of the least-squares estimators 
on its diagonal and the covarainces between them in the off-diagonal locations. (If we had used 1m, then 
morefit=summary.1m(regfit) would be used.) 


Solutions to Exercises 


1. After we have replaced 69 and 3; in (11.2.2) with Bo and Bis the maximization with respect to o? is 
exactly the same as the maximization carried out in Example 7.5.6 in the text for finding the M.L.E. 
alo": 


2. Since E(Y;) = 8o + 612i, it follows from Eq. (11.2.7) that 


nm nm nm 


So (xi — En)(B0+ Bivi) Bo d>(@i — En) + Br D5 Bi(ai — En) 
AG,) = = — ___é _ i=1 
Ve _ tay Soe _ cn 
i=1 i=1 
But 0-4 ( ) =0 and 
S > vi(24 — En) = S0 a;(2; -— Zp —in >) iG — Zn) 
i=l i=l a= 


It follows that E(3,) = (1. 


3. E 5H E(¥%) = — Dale + ivi) = Bo + BiFn. 


Hence, as een near the ead of the proof of Theorem 11.2.2, 


E(8) = E(Yn) — fnE (61) = (80 + Bi Fn) — Eni = Bo. 


n 
4. Let s2 = So (xi —£n)*. Then 
i=1 
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Since Y;,...,Y are independent and each has variance o?, 
, a ae . 
Var(0) = a F = 52 (vi 7 rn)| Var (Y;) 
i=1 © 
n = = 
1 ae 2% 
= 2 ee eM. _ xm \2 _ n = 
=o d E a (xj; — En) Ae (eB) 


as shown in part (a) of Exercise 2 of Sec. 11.1. 


5. Since Y, = Bp + B,En, then 
Var(Y;,) = Var (80) + ze Var(f1) + 27, Cov(Bo, B1). 


Therefore, if Z, 4 0, 


— i _ : 7 
Cov($o, 81) = aa [Var(¥n) — Var(Go) — ze Var(1)| 
1 2 aE me 
— +t | Otel 2 tn 2 
(2S, | on ies 27 


_ o i=l 
25 n nse 
_ o? [{ —2nz?2 _ az o 
2%, \ ns2 } 82 
& _ 1 
If Z, = 0, then 6) = Y¥, =— 5 Y;, and 
n + 
t=1 
a m 1 nm 1 n 1 nm nm 
Cov({o, 61) = Cov —S°Y¥i, 5 50 2; = 5) a Cov(¥;,¥5), 
n + se 4 ns2 —~ 4 
i=1 v 7=1 v j=19=1 
by Exercise 8 of Sec. 4.6. Since Yi,..., Y;, are independent and each has variance o?, then Cov(Yj, Yj) = 


O for i Aj and Cov(¥j, Y;) =o? for i =3. 


Hence, 


Cov(8o, 61) = = ei = 0. 


10. 


Li; 


12. 
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. Both Bo and By are linear functions of the random variables Y,,...,Y, which are independent and 


have identical normal distributions. As stated in the text, it can therefore be shown that the joint 
distribution of Bo and 6; is a bivariate normal distribution. It follows from Eq. (11.2.6) that Bo and By 
are uncorrelated when Z, = 0. Therefore, by the property of the bivariate normal distribution discussed 
in Sec. 5.10, Bo and By are independent when Z,, = 0. 


(a) The M.L.E.’s Bo and Bi are the same as the least squares estimates found in Sec. 11.1 for Table 11.1. 
The value of é? can then be found from Eq. (11.2.3). 

(b) Also, Var($9) = 0.250502 can be determined from Eq. (11.2.5) and Var(3,) = 0.02770? from Eq. 
(11.2.9). 


(c) It can be found from Eq. (11.2.6) that Cov(6o, 61) = —0.06460?. By using the values of Var(Go) 
and Var({1) found in part (b), we obtain 


Cov(, By) 


(Var(Ap) Var(By2 


(Bo, B1) = 


. 6 = 389 — 26; +5 = 1.272. Since 6 is an unbiased estimator of 0, the M.S.E. of 6 is the same as Var(6) 


and 


Var(6) = 9 Var() + 4 Var(81) — 12 Cov(fo, 61) = 3.14007. 


. The unbiased estimator is 380 + cy. The M.S.E. of an unbiased estimator is its variance, and 


Var(6) = 9 Var() + 6c, Cov( So, 81) + c? Var(A1). 
Using the values in Exercise 7, we get 
Var(0) = 0?[9 x 0.2505 — 6c; (0.0646) + c70.0277]. 


We can minimize this by taking the derivative with respect to c, and setting the derivative equal to 0. 
We get c, = 6.996. 


The prediction is Y= Bo + 264 = 0.584. The M.S.E. of this prediction is 
Var(Y) + Var(Y) = Var(8y) + 4 Var(31) + 4 Cov(fo, 81) + 02 = 1.10302. 


Alternatively, the M.S.E. of Y could be calculated from Eq. (11.2.11) with « = 2. 


By Eq. (11.2.11), the M.S.E. is 


1 n 
; So (xi =2) |e", 


We know that 37"_,(a; — x)? will be a minimum (and, hence, the M.S.E. will be a minimum) when 
GS Lys 


The M.L.E.’s Bo and 8, have the same values as this least squares estimates found in part (a) of 
Exercise 7 of Sec. 11.1. The value of oa can then be found from Eq. (11.2.3). Also, Var(Go) can be 
determined from Eq. (11.2.5) and Var(,) from Eq. (11.2.9). 
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13. It can be found from Eq. (11.2.6) that Cov(8o, 61) = —0.21407. By using the values of Var( 6) and 
Var(31) found in Exercise 12, we obtain 


___Gov(Bo,B1) 
[Var(Bo) Var(5;)]!/2 


14. 6=5 —4@y + 8, = —158.024. Since 6 is an unbiased estimator of 0, the M.S.E. is the same as Var (0) 
and 


p(Bo, 61) = — 0,851; 


Var(6) = 16 Var(Bo) + Var(81) — 8 Cov(o, 61) = 11.5240”. 


15. This exercise, is similar to Exercise 9. Var(@) attains its minimum value when c; = —Zp. 


16. The prediction is Y = Bo + 3.2581 = 42.673. The M.S.E. of this prediction is 
Var(Y) + Var(Y) = Var(f) + (3.25)? Var(8,) + 6.50 Cov(8, 61) + 0? = 1.22007. 
Alternatively, the M.S.E. of Y could be calculated from Eq. (11.2.11) with x = 3.25. 
17. It was shown in Exercise 11, that the M.S.E. of Y will be a minimum when x = Zp, = 2.25. 


18. (a) It is easiest to use a computer to find the least-squares coefficients. These are Bo = —1.234 and 
8, = 2.702. 
(b) The predicted 1980 selling price for a species that sold for x = 21.4 in 1970 is 
Bo + Bie = —1.234 + 2.702 x 21.4 = 56.59. 
(c) The average of the x; values is 41.1, and s2 = 18430. Use Eq. (11.2.11) to compute the M.S.E. as 
2 1. (4 = ait 


1+—+ 


= 1. : 
14 18430 | aes 


19. The formula for E(X2|x1) is Eq. (5.10.6), which we repeat here for the case in which pu; = pug = pw and 
0, =090=—0: 


v1 — bh 


B(Xa|a1) = 1+ po ( ) = H+ par =). 


We are asked to show that |F(X|r1) — u| < |v, — | for all 21. Since 0 <p <1, 


|E(Xe|e1) — wl = |e + p(t1 — #) — Bl = pai — pl < |z1 — pl. 


11.3 Statistical Inference in Simple Linear Regression 


Commentary 


Computation and plotting of residuals is really only feasible with the help of a computer, except in problems 
that are so small that you can’t learn much from residuals anyway. There is a subsection at the end of this 
section on joint inference about $9 and 6,. This material is mathematically more challenging than the rest 
of the section and might be suitable only for special sets of students. 

If one is using the software R, both lm and lsfit provide the residuals. These can then be plotted 
against any other available variables using plot. Normal quantile plots are done easily using qqnorm with 
one argument being the residuals. The function qqline (with the same argument) will add a straight line to 
the plot to help identify curvature and outliers. 
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Solutions to Exercises 


nm nm 
1. It is found from Table 11.9 that Z, = 0.42, Gn = 0.33, 5-2? = 10.16, S~ xiy; = 5.04, 6, = 0.435 and 
i=1 i=1 
By = 0.147 by Eq. (11.1.1), and S? = 0.451 by Eq. (11.3.9). Therefore, from Eq. (11.3.19) with n = 10 
and 65 = 0.7, it is found that Up = —6.695. It is found from a table of the t distribution with n — 2 = 8 
degrees of freedom that to carry out a test at the 0.05 level of significance, Hp should be rejected if 
|Up| > 2.306. Therefore Ho is rejected. 


2. In this exercise, we must test the following hypotheses: 
Ho : Bo = 0, 
A : Bo # 0. 


Hence, $5 = 0 and it is found from Eq. (11.3.19) that Up = 1.783. Since |Uo| < 2.306, the critical value 
found in Exercise 1, we should not reject Ho. 


3. It follows from Eq. (11.3.22), with bf = 1, that U; = —6.894. Since |U;| > 2.306, the critical value 
found in Exercise 1, we should reject Ho. 


4. In this exercise, we want to test the following hypotheses: 


Ao : B, = 0, 
Ay: Py #0. 


Hence, 6; = 0 and it is found from Eq. (11.3.22) that Uy = 5.313. Since |U;| > 2.306, we should reject 
Hos 


5. The hypotheses to be tested are: 
Ho : 589 — Bi = 9, 
A, : 580 — Bi £0. 
n 
Hence, in the notation of (11.3.13), co = 5,c, = —1, and c, = 0. It is found that So (cori —c,)? = 306 


i=1 
and, from Eq. (11.3.14), that Up; = 0.664. It is found from a table of the t distribution with n —2 = 8 
degrees of freedom that to carry out a test at the 0.10 level of significance, Hp should be rejected if 
|Uo1| > 1.860. Therefore, Hp is not rejected. 


6. The hypotheses to be tested are: 


Ho: 60 + fi = 1, 
Ay: Bot fi Al. 


n 
Therefore, co = cj = c. = 1. It is found that S (cori —c,)? = 11.76 and, from Eq. (11.3.14), that 


i=1 
Uo, = —4.701. It is found from a table of the t distribution with n — 2 = 8 degrees of freedom that to 
carry out a test at the 0.10 level of significance, Ho should be rejected if |Up1| > 3.355. Therefore, Ho 
is rejected. 
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Cov(61,D) = Cov(G1, ho + B1En) 

= Cov(61, 0) + Zn Cov( fr, 81) 
Cov(Bo, 61) + En Var(B1) 
= 0, by Eqs. (11.2.9) and (11.2.6). 


Since Bo and By have a bivariate normal distribution, it follows from Exercise 10 of Sec. 5.10 that D 
and (, will also have a bivariate normal distribution. Therefore, as discussed in Sec. 5.10, since D and 
6, are uncorrelated they are also independent. 


(a) We shall add nz2(, — 8*)? and subtract the same amount to the right side of Q?, as given by 
Eq. (11.3.30). The Q? can be rewritten as follows: 


Q* = (30? - nat) (Br — BE)” + m[(Bo — 65)? + 2Fn (80 — 83)(B1 — Bt) + #81 — BF)? 
_ 2iBi= (61 — Bi)? a 5 *\7]2 
= Var(By) + n[(Bo — 83) + En(Br — Bt)/. 
Hence, 
Q? _ (bi — Bt)? 


“= __ a k= \2 
a Var(py) ae Bo — Bien) 


It remains to show that Var(D) = Ea But 
Var(D) = Var(8o) + £7; Var(51) + 2%n Cov(Bo, 41). 
The desired result can now be obtained from Eqs. (11.2.9), (11.2.5), and (11.2.6). 


(b) It follows from Exercise 7 that the random variables 6, and D are independent and each has a 
normal distribution. When Hp is true, E(G,) = 6% and E(D) = 6% + Bez. Hence, Hp is true, 
each of the two summands on the right side of the equation given in part (a) is the square of a 
random variable having a standard normal distribution. 


. Here, 65 = 0 and #* = 1. It is found that Q? = 2.759, S? = 0.451, and U? = 24.48. It is found from a 


table of the F' distribution with 2 and 8 degrees of freedom that to carry out a test at the 0.05 level of 
significance, Hp should be rejected if U? > 4.46. Therefore, Ho is rejected. 


To attain a confidence coefficient of 0.95, it is found from a table of the ¢ distribution with 8 degrees of 
freedom that the confidence interval will contain all values of 85 for which |Uo| < 2.306. When we use 
the numerical values found in Exercise 1, we find that this is the interval of all values of 85 such that 
—2.306 < 12.111(0.147 — 65) < 2.306 or, equivalently, —0.043 < 85 < 0.338. This interval is, therefore, 
the confidence interval for {o. 


The solution here is analogous to the solution of Exercise 9. Since the confidence coefficient is again 
0.95, the confidence interval will contain all values of 6] for which |U;| < 2.306 or, equivalently, for 
which —2.306 < 12.207(0.435 — Gf) < 2.306. The interval is, therefore, found to be 0.246 < 6; < 0.624. 


We shall first determine a confidence interval for 589 — 6, with confidence coefficient 0.90. It is found 
from a table of the t distribution with 8 degrees of freedom (as in Exercise 5) that this confidence 
interval will contain all values of c, for which |Up;| < 1.860 or, equivalently, for which —1.860 < 


13. 


14. 


15. 


16. 
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2.207(0.301 — c,) < 1.860. This interval reduces to —0.542 < c, < 1.144. Since this is a confidence 
interval for 589 — 61, the corresponding confidence interval for 589 — 6; + 4 is the interval with end 
points (—0.542) + 4 = 3.458 and (1.144) + 4 = 5.144. 


We must determine a confidence interval for y = 69+, with confidence coefficient 0.99. It is found from 
a table of the t distribution with 8 degrees of freedom (as in Exercise 6) that this confidence interval will 
contain all values of c, for which |Up;| < 3.355 or, equivalently, for which —3.355 < 11.257(0.582— 6) < 
3.355. This interval reduces to 0.284 < c, < 0.880. This interval is, therefore, the confidence interval 
for y. 


We must determine a confidence interval for y = 69 + 0.42;. Since the confidence coefficient is again 


0.99, as in Exercise 13, this interval will again contain all values of c, for which |Uoi| < 3.355. Since 
n 


co = 1 and cy = 0.42 = Z, in this exercise, the value of SS" (covi _ c1)’, which is needed in determining 


i=1 
n 


Uo1, is equal to _@ — ,)* = 8.396. Also, co8o + 181 = Bo + Bi, Zn = Jn = 0.33. Hence it is found 


i=1 
that the confidence interval for y contains all values of c, for which —3.355 < 13.322(0.33 — c,) < 3.355 
or, equivalently, for which 0.078 < c, < 0.582. 


Let q be the 1 — ap/2 quantile of the t distribution n — 2 degrees of freedom. A confidence interval for 
Go + 812 contains all values of c, for which |Uoi| < c, where co = 1 and cj = x in Eq. (11.3.14). The 
inequality |Uo1| < q can be reduced to the following form 


n 1/2 n 1/2 
GPS ae: ee Ce ee og)? 
i=1 
(n 


x 
Bo+ ahi —4q =D yFy <b<fot+ahr+a 


The length of this interval is 


alan Hee 28) 


nm 
The length will, therefore, be a minimum for the value of « which minimizes per —x)?. We know 
i=1 
that this quantity is a minimum when x = Zp. 
It is known from elementary calculus that the set of points (x,y) which satisfy an inequality of the 
form Ax? + Bary + Cy? < c? will be an ellipse (with center at the origin) if and only if B? — 4AC <0. 
It follows from Eqs. (11.3.30) and (11.3.32) that U? < ¥ if and only if 


n((85 — Bo)? + 2nFn (85 — 8o)(8 A+ (3 Tj 2) (6 - ag 78. 


Hence, the set of points (5, 37) which satisfy this inequality will be an ellipse [with center at (Bo, 61)| 
if and only if 


n 
(QnZn)* — 4nS > x; <0 
i=1 
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or, equivalently, if and only if 


nm 

2 a2 
S> a; — nx, > 0. 
i=1 


nm 
Since the left side of this relation is equal to DIC? - En)’, it must be positive, assuming that the 
i=1 
numbers 71,...,2%p, are not all the same. 
17. To attain a confidence coefficient of 0.95, it is found from a table of the F' distribution with 2 and 
8 degrees of freedom (as in Exercise 9) that the confidence ellipse for (69,61) will contain all points 
(83, Bf) for which U? < 4.46. Hence, it will contain all points for which 


10(8% — 0.147)? + 8.4(6% — 0.147)(6* — 0.435) + 10.16(67 — 0.435)? < 0.503. 


18. (a) The upper and lower limits of the confidence band are defined by (11.3.33). In this exercise, n = 10 
and (27)!/2 = 2.987. The values of 8, 61, and S? have been found in Exercise 1. 


Numerical computation yields the following points on the upper and lower limits of the confidence 


band: 
x Upper limit Lower limit 
—2 —.090 —1.356 
—1 124 —.700 
0 395 —.101 
In = 0.42 504 106 
1 848 316 
2 1.465 .569 


The upper and lower limits containing these points are shown as the solid curves in Fig. $.11.1. 


(b) The upper and lower limits are now given by (11.3.25), where T,-2(1 — ao/2) = 2.306. The 
corresponding values of these upper and lower limits are as follows: 


a Upper limit Lower limit 
—2 —.234 —1.212 
—1 .030 —.606 
0 338 —.044 
In = 0.42 .503 157 
i £787 377 
2 1.363 671 


These upper and lower limits are shown as the dashed curves in Fig. $.11.1. 


19. If S? is defined by Eq. (11.3.9), then $?/o? has a y? distribution with n — 2 degrees of freedom. 
Therefore, E(S?/o07) =n — 2, E(S”) = (n — 2)o?, and E(S?/|n — 2]) = 07. 


20. (a) The prediction is By + 6,X = 68.17 — 1.112 x 24 = 41.482. 


(b) The 95% predicition interval is centered at the prediction from part (a) and has half-width equal 
to 


1 (24—30.91)2]'? 


32 2054.8 
So, the interval is 41.482 + 8.978 = [32.50, 50.46]. 


T3p/(0.975)4.281 | 1 + = 8.978. 
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Ya 


Figure $.11.1: Confidence bands and intervals for Exercise 18a in Sec. 11.3. 


21. (a) A computer is useful to perform the regressions and plots for this exercise. The two plots for parts 
(a) and (b) are side-by-side in Fig. $.11.2. The plot for part (a) shows residuals that are more 
spread out for larger values of 1970 price than they are for smaller values. This suggests that the 
variance of Y is not constant as X changes. 


(b) The plot for part (b) in Fig. $.11.2 has more uniform spread in the residuals as 1970 price varies. 
However, there appear to be two points that are not fit very well. 


22. In this problem we are asked to regress logarithm of 1980 fish price on the 1970 fish price. (It would 
have made more sense to regress on the logarithm of 1970 fish price, but the problem didn’t ask for 
that.) The summary of the regression fit is 69 = 3.099, 6, = 0.0266, o’ = 0.6641, Z, = 41.1, and 

2 
e = 18430. 


(a) The test statistic is given in Eq. (11.3.22), 


By -2 0.0266 — 2 

= 135.83 —_____ = — 403.5. 

o! 0.6641 

We would reject Ho at level 0.01 if U is greater than the 0.99 quantile of the ¢ distribution with 


12 degrees of freedom. We do not reject the null hypothesis at level 0.01. 


U = Ss, 


(b) A 90% confidence interval is centered at 0.0266 and has half-width equal to 


0.6641 


= 0.00872. 
135.8 


a’ 
Ti5' (0.95) — = 1.782 
Sx 


So, the interval is 0.0266 + 0.00872 = [0.0179, 0.0353]. 
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Figure S.11.2: Residual plots for Exercise 21a of Sec. 11.3. The plot on the left is for part (a) and the plot 
on the right is for part (b). 


(c) A 90% prediction interval for the logarithm of 1980 price is centered at 3.099+0.0266 x 21.4 = 3.668 
and has half-width equal to 


1 
= 1.782 x 0.6641 |}1+ — + ( 


1/2 1/2 
1 ie 21.4 — 41.1)? 
= | / _ n 
Tix (0.95)o" |1+ re nope) am ae 


1.237. 


So, the interval for the logarithm of 1980 price is 3.668 + 1.237 = [2.431, 4.905]. To convert this 
to 1980 price take e to the power of both endpoints to get [11.37, 134.96]. 


If we had been asked to regress on the logarithm of 1970 fish price, the summary results would have been 
Bo = 1.132, 8, = 0.9547, o! = 0.2776, Ep, = 3.206, and s2 = 19.11. The test statistic for part (a) would 
have been 4.371(0.9547 — 2)/0.2776 = —16.46. Still, we would not reject the null hypothesis at level 
0.01. The confidence interval for 6, would have been 0.9547 + 1.782(0.2776/4.371) = [0.8415, 1.067]. 
The prediction interval for the logarithm of price would have been 


1 (log(21.4) — 3.206)? 
1.132 + 0.9547 log(21.4) + 0.2776 (: yb 4 Megehs) = 3.206)" 


1/2 
= (3.769, 4.344]. 
14 19.11 


The interval for 1980 price would then be [43.34, 77.02]. 
23. Define 


Wo = is + (con a) “ co(Bo — Bo) + c2(B1 — Br) 


’ 
82 a! 


which has the ¢ distribution with n — 2 degrees of freedom. Hence 
Pr(Wo1 = / ae a = ao)) = ao- 


Suppose that coo + ¢161 < c.. Because [(c}/n) + (co¥n — c1)?/s2]/o’ > 0, it follows that Wo1 > U1. 
Finally, he probability of type I error is 


Pr(Uo1 > Ty 4(1 — a0)) < Pr(Wo1 > Ty, (1 — a0)) = a0, 
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where the first inequality follows from Wo, > Uo1. 


24. (a) When co = 1 and cj = x > 0, the smallest possible value of 85 + x] occurs at the smallest values 
of 65 and GF simultaneously. These values are 


2 

Bo — 0! E eee 

F fica 
pis pee rs 

Similarly, the largest values occur when (5 and 6} both take their largest possible values, namely 


a =271/2 
By +0" [2+ 28) Tt, (1-2), 


b+ 271) ( are 


The confidence interval is then 


ES A ! L: es GL =| Qo 
bo + fix -—o a + — Ty 9 1-—], 
nm 84 Sx 4 
_571/2 
‘ x il 2 He ray 
it +E} (1-9). 
n Sis 4 


z 
— 4.2 
st 
(b) When co = 1 and c; = x < 0, the smallest possible value of 85 + xf occurs when (35 takes its 
smallest possible value and (¥ takes its largest possible value. Similarly, the largest possible value 
of 65 + #6f occurs when (3 takes its largest possible value and {7 takes its smallest possible value. 
All of these extreme values are given in part (a). The resulting interval is then 


A A ! ill Fi ue GL = Qo 
Pot fix —o a) = ia a , 


1/2 


2 
ee (1 > 2) ’ 


1 
ee 
nm 84 


iy et 


25. (a) The simultaneous intervals are the same as (11.3.33) with 2Fon-2(1 —ag)|'/? replaced by T',(1— 
ag/4), namely for i = 0,1, 


Bo + Bia; + Tr24(1 — a0/4)o’ ~ 
x 


1, cca 
n 


(b) Set 2 = ax + (1 — a)ax, and solve for a. The result is, by straightforward algebra, 
a(2) = 2"), 

TQ — e1 

(c) First, notice that for all x, 


Bo + Pix = a(x)[Go + F120] + [1 — a(x)][o + B12]. (8.11.1) 


That is, each parameter for which we want a confidence interval is a convex combination of the 
parameters for which we already have confidence intervals. 

Suppose that C' occurs. There are three cases that depend on where a(z) lies relative to the interval 
[0,1]. The first case is when 0 < a(x) < 1. In this case, the smallest of the four numbers defining 
L(x) and U(x) is L(x) = a(x)Ap + [1 — a(x)]Ay and the largest is U(x) = a(x) Bo + [1 — a(x)|Bi, 
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because both a(x) and 1 — a(x) are nonnegative. For all such x, Ap < 80 + 6120 < Bo and 
A, < 69 + Bix, < By together imply that 


a(x)Ao + [l — a(@)]A1 < a(x) [60 + Bixo] + [1 — o(2)][0 + B1a1] < a(x) Bo + [1 — a(a)] Bi. 


Combining this with ($.11.1) and the formulas for L(#) and U(x) yields L(x) < 89 + Bix < U(x) 
as desired. The other two cases are similar, so we shall do one only of them. If a(x) < 0, 
then 1 — a(x) > 0. In this case, the smallest of the four numbers defining L(x) and U(x) is 
L(x) = a(x)Bo + [1 — a(x)]A1, and the largest is U(x) = a(x) Ao + [1 — a(x)|B1. For all such z, 
Ap < 89 + 61% < Bo and A; < 89 + 62, < By, together imply that 


a(x) Bo + [1 — a(x)]A1 < a(x)[69 + 8120] + [1 — a(x)][8o + B121] < a(a)Ap + [1 — a(x)| By. 


Combining this with (S.11.1) and the formulas for L(x) and U(z) yields L(x) < 89 + Gir < U(x) 
as desired. 


11.4 Bayesian Inference in Simple Linear Regression 


Commentary 


This section only discusses Bayesian analysis with improper priors. There are a couple of reasons for this. 
First, the posterior distribution that results from the improper prior makes many of the Bayesian inferences 
strikingly similar to their non-Bayesian counterparts. Second, the derivation of the posterior distribution 
from a proper prior is mathematically much more difficult than the derivation given here, and I felt that 
this would distract the reader from the real purpose of this section, namely to illustrate Bayesian posterior 
inference. This section describes some inferences that are similar to non-Bayesian inferences as well as some 
that are uniquely Bayesian. 


Solutions to Exercises 


1. The posterior distribution of 3, is given as a special case of (11.4.1), namely that U = sz(81 — 81) /o! 
has the ¢ distribution with n — 2 degrees of freedom. The coefficient 1 — ag confidence interval from 
Sec. 11.3 has endpoints 3, + T;1,(1 — ao/2)o’/sz. So, we can compute the posterior probability that 
6, is in the interval as follows: 


Pr (4-7, Ti 2/2) < By < By + Teg(1 — 09/2)2 =) 
A Bhs 


= Pr (-2 9(1 — ao/2) < Sz ee (ee =) (S.11.2) 


Since the t distributions are symmetric around 0, —Ty"»(1 — a9/2) = Ty"2(a0/2). Also, the random 
variable between the inequalities on the right side of (S.11.2) is U, which has the ¢ distribution with 
n — 2 degrees of freedom. Hence the right side of ($.11.2) equals 

Pr(U < T71(1 — ag/2)) — Pr(U < T~*(ag/2)) = 1 — a9 /2 — ap /2 = 1— ap. 


2. The posterior distribution of ; is given in (11.4.1), namely that 


u = [2 4 (60% =)? amy cody + e161 ~ [eodo + e181) 


n 82 ol 
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has the ¢ distribution with n — 2 degrees of freedom. The coefficient 1 — ag confidence interval from 
Sec. 11.3 has endpoints 


2 
Cc 
— 
nm 


coo + c1 By a= Yara 6) _ ag /2)o" 2 
r 


(Cal — ar) oe 


So, we can compute the posterior probability that 6, is in the interval as follows: 


ce (cotn _ 1)? Le 
+ =| < C80 + e1h1 


x 


_ 1/2 
ei (cot — oF) 
nm S 


a [ot + ¢18; — Ty2)(1 — a9/2)o" 


< coBo + C181 — Tl — ag /2)o’ 


x 


= Pr(-Ty4,(1 — 09/2) < U < Tyg(1 — 00/2). 


As in the proof of Exercise 1, this equals 1 — ao. 


. The joint distribution of (Go, 31) given 7 is a bivariate normal distribution as specified in Theorem 11.4.1. 
Using the means, variances, and correlation given in that theorem, we compute the mean of 89 + 61x 
as Bo + 0ia = Y. The variance of 69 + 6, given T is 


1/2 
1 | a il. ie 1 
= ag tog ae (2 +3 —*s 
TIN Sp 8% n no sy Sx 
i=1 
Use the fact that 1/n + %2/s? = 7", x?/[ns2] to simplify the above variance to the expression 
1/2 


1 E+ (2 -Fy)? 


It follows that the conditional distribution of r!/?(69 — 8,2 — Y) is the normal distribution with mean 
0 and variance as stated in the exercise. 


. The summaries from a simple linear regression are Bo = 0.1472, Bi =(,4352, o = 0.2374. gz, = 0.42, 
n = 10 and s2 = 8.396. 


(a) The posterior distribution of the parameters is given in Theorem 11.4.1. With the numerical sum- 
n 


maries above (recall that ys x? = s* + nt = 10.16), we get the following posterior. Conditional 
i=l 

on T, (60, 81) has a bivariate normal distribution with mean vector (0.1472, 0.4352), correlation 

—0.4167, and variances 0.1210/7 and 0.1191/7. The distribution of 7 is a gamma distribution 

with parameters 4 and 0.2254. 


(b) The interval is centered at 0.4352 with half-width equal to Tz '(0.95) times 0.2374/8.396!/2 = 
0.0819. So, the interval is [0.2828, 0.5876]. 


366 Chapter 11. Linear Statistical Models 


(c) The posterior distribution of Go is that U = (89 — 0.1472)/0.1210!/? has the t distribution with 
8 degrees of freedom. So the probability that (Go is between 0 and 2 is the probability that U is 
between (0 — 0.1472) /0.3479 = —0.4232 and (2 — 0.1472)/0.3479 = 5.326. The probability that a 
t random variable with 8 degrees of freedom is between these two numbers can be found using a 
computer program, and it equals 0.6580. 


5. The summary data are in the solution to Exercise 4. 
(a) According to Theorem 11.4.1, the posterior distribution of 6; is that 2.898(6, — 0.4352) /0.2374 
has the ¢ distribution with eight degrees of freedom. 
(b) According to Theorem 11.4.1, the posterior distribution of 69 + 6; is that 


01 4 (042-1? ~M? By + 61 — 0.5824 
2.8982 0.2374 


has the ¢ distribution with eight degrees of freedom. 


6. The summary information from the regression is Bo = 1.132, By = (0547, a = 0.2776, T, = 3.206, 
m= 14, and 5? = 19.11, 


(a) The posterior distribution of 6, is that U = 19.11!/2(8, — 0.9547) /0.2776 has the t distribution 
with 12 degrees of freedom. 


(b) The probability that 3; < 2 is the same as the probability that U < 19.11'/?(2—0.9547)/0.2776 = 
16.46, which is essentially 1. 


(c) The interval for log-price will be centered at 1.132 + 0.9547 x log(21.4) = 4.057 and have half- 
width T,5' (0.975) times 0.2776[1 + 1/14 + (3.206 — log(21.4))?/19.11]!/ = 0.2875. So, the interval 
for log-price is [3.431, 4.683]. The interval for 1980 price is e to the power of the endpoints, 
(30.90, 108.1]. 


7. The conditional mean of $9 given 6; can be computed using results from Sec. 5.10. In particular, 
_ 1 og 
nN. _ 
Ende (- +. /s;, 


s) 


nm 
Now, use the fact that ve a? = s* + n&*. The result is 
i=l 


)" 
E(Bo|B1) = Bo — (81 — 61). 


E(8o|61) = Bo + En(B1 — 61). 


11.5 The General Linear Model and Multiple Regression 


Commentary 


If one is using the software R, the commands to fit multiple linear regression models are the same as those 
that fit simple linear regression as described in the Commentaries to Secs. 11.1-11.3. One need only put the 
additional predictor variables into additional columns of the x matrix. 
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Solutions to Exercises 


1. After we have replaced {,..., 8, in (11.5.4) by their M.L.E.’s Bo, sing , Bos the maximization with respect 
to a? is exactly the same as the maximization carried out in Example 7.5.6 in the text for finding the 
M.L.E. of o? or the maximization carried out in Exercise 1 of Sec. 11.2. 


2. (The statement of the exercise should say that $?/o? has a x? distribution.) According to Eq. (11.5.8), 


8? 


Since we assume that $?/o? has a y? distribution with n — p degrees of freedom, the mean of $? is 
o?(n — p), hence the mean of o? is a”, and 0? is unbiased. 


3. This problem is a special case of the general linear model with p = 1. The design matrix Z defined by 
Eq. (11.5.9) has dimension n x 1 and is specified as follows: 


In, 


n 
Therefore, Z'Z = ya and (Z'Z)"* = ; 
i=l S- a 
v 
It follows from Eq. (11.5.10) that 


n 


Yrs 
i 
nm 


4. From Theorem 11.5.3, E(3) = 6 and Var(f) = of ya. 
i=1 


5. It is found that yt; = 9424 and Se = 66.8. Therefore, from Exercises 3 and 4, B = 5.126 
and Var(8) = 0.015007. Also, S? = 3“ ,(y; — Ba;)? = 169.94. Therefore, by Eq. (11.5.7), 6? = 
(169.94) /10 = 16.994. 


6. By Eq. (11.5.21), the following statistic will have the ¢ distribution with 9 degrees of freedom when Ho 
is true: 


9 1/2, 
= | 7.0150)(169.94) a aa 


The corresponding two-sided tail area is smaller than 0.01, the smallest two-sided tail area available 
from the table in the back of the book. 
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7. The values Bo, By, and Bo were determined in Example 11.1.3. 


10. 


11. 


By Eg. (11.5.7), 


i t= a : 1 
a2 2 2\2 
10 10 dy Bo — Bits — Box;) id° ) 


. The design matrix Z has the following form: 


1 XY a 
1 r2 xe 
Z= A 
1 In ie 


Therefore, Z’Z is the 3 x 3 matrix of coefficients on the left side of the three equations in (11.1.14): 


Z'2Z =| 233 9037 401 


90.37 401 1892.7 


10 23.3 90.37 | 


It will now be found that 


(2'Z2)>*=|=0307 0491 -=0074 


0.046 —0.074 0.014 


0.400 —0.307 07 


The elements of (Z'Z )~', multiplied by o?, are the variances and covariances of Bo, Bi, and Bo. 


. By Eq. (11.5.21), the following statistic will have the ¢ distribution with 7 degrees of freedom when Ho 


is true: 


7 1/2. 


ae | 
The corresponding two-sided tail area is greater than 0.90. The null hypothesis would not be rejected 
at any reasonable level of significance. 


By Eq. (11.5.21), the following statistic will have the t distribution with 7 degrees of freedom when Ho 
is true: 


7 1/2 
CaMEnl ( 


Ul = | by — 4) = 4.51. 


The corresponding two-sided tail area is less than 0.01. 
It is found that 37", (yi; — Jn) = 26.309. Therefore, 


2 


R2=1- = 
26.309 


0.644 


12. 


13. 


14. 


15. 


16. 
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The values of Bo, By and Bo were determined in Example 11.1.5. 
By Eq. (11.5.7), 


i 
— (8.865) = 0.8865. 


1 ee : 
92. 2 _ a .)2 — 
= =s 5 — Bo — Pita — Bor2) 10 


The design matrix Z has the following form: 


1 X14 X42 

1 L241 L22 
Z- 

1 Ini In2 


Therefore, Z’Z is the 3 x 3 matrix of coefficients on the left side of the three equations in (11.1.14): 


Z'Z =| 23.3 90.37 1563.6 


650 1563.6 42,334 


10 23.3 650 | 


It will now be found that 


4.832 0.1355  —0.0792 
—3.598 —0.0792 0.0582 


(Z'Z)1= 


222.7 4.832 —3.598 | 


The elements of (Z'Z )~', multiplied by o?, are the variances and covariances of Bo, Bi, and Bo. 


By Eq. (11.5.21), the following statistics will have the ¢ distribution with 7 degrees of freedom when 
Ho is true: 


7 1/2 


(0.1355) (8.865) piety 


Uy, = 


The corresponding two-sided tail area is between 0.30 and 0.40. 


By Eq. (11.5.21), the following statistic will have the t distribution with 7 degrees of freedom when Ho 
is true: 


7 


U2 = aE RS) 


1/2. 
| (Bo +1) = 4.319. 
The corresponding two-sided tail area is less than 0.01. 


nm 
Just as in Exercise 11, SG: — jn)? = 26.309. Therefore, 
i=1 
2 


26.309 


R?=1- = 0.663. 
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17. 


18. 


19. 
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Cov(B;, Ai;) = Cov (3.8 _ Si 9,) 


Ge 

=x yh o)< 7 Cov(4j, B;) 
II 

= Cov(6,8,) ~ $2 Var, 
JI 


= Gyo? = SH 6192 =0, 


Gj J 


Just as in simple linear regression, it can be shown that the joint distribution of two estimators ; 
and B; will be a bivariate normal distribution. Since A,; is a linear function of B; and B;, the joint 
distribution of A;; and 8; will also be a bivariate normal distribution. Therefore, since A;; and B; are 
uncorrelated, they are also independent. 


Var(Aij) = Var (3 — 45) 


2 
Var (B;) + (2) Var(B;) — 2S Cov(8;, B;) 


dd 
2. 2. 2. 
ty 2 UW 2 — 2 iz) 2 
Ge _— me = a fe O°. 
us we II 


Now consider the right side of the equation given in the hint for this exercise. 


| 
a 
Q 
two 
ela 


[Ay — E(Ay)2 = (A - Bi 28, 6;)] 
2 2655 i 
= (8: — Bi) _ a (8; — Bi) (B; — Bj) + = (6; — 6;) 
Jj 34 


If each of the two terms on the right side of the equation given in the hint is put over the least common 
denominator (GjiGjj — {Jo o”, the right side can be reduced to the form given for W? in the text of 
the exercise. In the dguation for W? given in the hint, W? has been represented as the sum of two 
independent random variables, each of which is the square of a variable having a standard normal 
distribution. Therefore, W? has a y? distribution with 2 degrees of freedom. 


(a) Since W? is a function only of B; and By, it follows that W? and S? are independent. Also, W? has 
a x? distribution with 2 degrees of freedom and $?/a? has a y? distribution with n —p degrees of 
W?/2 

S?/|o7(n — p)] 

(b) If we replace 8; and 8; in W? by their hypothesized values 67 and 8;, then the statistic given in 
part (a) will have the F distribution with 2 and n—p degrees of freedom when Hp is true and will 
tend to be larger when Hp is not true. Therefore, we should reject Ho if that statistic exceeds some 
constant C’, where C' can be chosen to obtain any specified level of significance ag(0 < ag < 1). 


freedom. Therefore, has the F' distribution with 2 and n — p degrees of freedom. 
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20. In this problem 7 = 2,7 = 3, 6] = 05 = 0 and, from the values found in Exercises 7 and 8, 


Ww? = (0.014) (0.616)? + (0.421) (0.013)? + 2(0.074)(0.616)(0.13) 16.7 
= ——~".491)(0.014) — (0.074)2Jo2——SOCSCSC~CS 


Also, S* = 9.37, as found in the solution of Exercise 7. Hence, the value of the F statistic with 2 and 7 
degrees of freedom is (7/2)(16.7/9.37) = 6.23. The corresponding tail area is between 0.025 and 0.05. 


21. In this problem, ¢ = 2,7 = 3, Gf = 1, 65 =0 and from the values found in Exercises 12 and 13, 


2 —_ (0.0582)(0.4503 — 1)? + (0.1355)(0.1725)? + 2(0.0792) (0.4503 — 1)(0.1725) 
7 (0.1355) (0.0582) — (0.0792)2]o? 
4.091 


o2 


Also, S? = 8.865, as found in the solution of Exercise 12. Hence, the value of the F statistic with 2 and 
7 degrees of freedom is (7/2)(4.091/8.865) = 1.615. The corresponding tail area is greater than 0.05. 


22. S2=S "(yi — Bo — Bizi)?. Since Bo = Gn — Fitn, 


i=1 


il n n 
= So (Yi — Gn)? - i > (@i — En) — 281 S > (ti — En) (yi — Gn) 
i=1 i=1 i=l 


Since 6, = —+— and R? =1-—— , the desired result can now be obtained. 
"(Yi — Gn) S-(yi — Gn) 
i=l i=1 
23. We have the following relations: 
4+7% E(X, + Yj) E(X1) + E(¥i) 
E(X+Y) = E : = = 
Xn+Yn E(Xn + Yn) E(Xn) + E(¥n) 
E(X1) E(Y1) 
= : oo : =E(X)+E(Y) 
E(Xn) E(Yn) 


24. The element in row i and column j of the n x n matrix Cov(X + Y ) is Cov(X; + Y;,X; + Yj) = 
Cov(X;, X;)+Cov(X;, Y;) +Cov(Yi, X;)+Cov(¥;, Yj). Since X and Y are independent, Cov( Xj, Y;) = 
0 and Cov(¥;, X;) = 0. Therefore, this element reduces to Cov(X;, X;) + Cov(¥j, Yj). But Cov( Xj, X;) 
is the element in row i and column j of Cov(X ), and Cov(Y;, Y;) is the corresponding element in 
Cov(Y ). Hence, the sum of these two covariances is the element in row 7 and column j of Cov(X ) + 
Cov(Y ). Thus, we have shown that the element in row i and column j of Cov(X + Y ) is equal to the 
corresponding element of Cov(X ) + Cov(Y ). 


oie 


25. We know that Var(3Y, + Yo — 2Y3 + 8) 


26. 
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Var(3Y, + Y2 — 2Y3). By Theorem 11.5.2, with p = 1, 


3 
Var (8Y1 + Yo — 2Y3) = (3,1, —2) Cov(Y ) | 1 | = 87, 


(a) 


We can see that S. Cj B; is equal to c B, where c is defined in part (b) and B is the least-squares 
regression coefficient vector. If Y is the vector in Eq. (11.5.13), then we can write c’3 = a'Y, 
where a! = c'(Z'Z)~'Z. It follows from Theorem 11.3.1 that a’Y has a normal distribution, and 
it follows from Theorem 11.5.2 that the mean of a’Y is c’@ and the variance is 
oad'a =o (Z'Z)1'e. (8.11.3) 

If Ho is true, then c’ B has the normal distribution with mean c, and variance given by (S.11.3). 
It follows that the following random variable Z has a standard normal distribution: 

7 cB — ce 

~ a(el(Z'Z)-1e)1/2" 
Also, recall that (n—p)o'?/o? has a x? distribution with n—p degrees of freedom and is independent 


of Z. So, if we divide Z by o’/o, we get a random variable that has a t distribution with n — p 
degrees of freedom, which also happens to equal U. 


To test Hp at level ag, we can reject Ho if |U| > Tt —ao/2). If Ho is true, 
Pr(|U| > Ty-p(1 — a0/2)) = a0, 


so this test will have level ag. 


27. In a simple linear regression, Y; is the same linear function of X; for all i. If By > 0, then every unit 
increase in X corresponds to an increase of By inY. So, a plot of residuals against Y will look the same 
as a plot of residuals against X except that the horizontal axis will be labeled differently. If By < 0, 
then a unit increase in X corresponds to a decrease of — By in Y, so a plot of residuals against fitted 
values is a mirror image of a plot of residuals against X. (The plot is flipped horizontally around a 


vert 


ical line.) 


28. Since R? is a decreasing function of the residual sum of squares, we shall show that the residual sum 
of squares is at least as large when using Z’ as when using Z. Let Z have p columns and let Z’ have 


q< 


p columns. Let B, be the least-squares coefficients that we get when using design matrix Z’. For 


each column that was deleted from Z to get Z’, insert an additional coordinate equal to 0 into the 
q-dimensional vector @, to produce the p-dimensional vector 3. This vector @ is one of the possible 
vectors in the solution of the minimization problem to find the least-squares estimates with the design 


mat 


rix Z. Furthermore, since 3 has 0’s for all of the extra columns that are in Z but not in Z’, it 


follows that the residual sum of squares when using 3 with design matrix Z is identical to the residual 
sum of squares when using 3, with design matrix Z’. Hence the minimum residual sum of squares 
available with design matrix Z must be no larger than the residual sum of squares using 3 with design 


mat 


rix Z’. 


29. In Example 11.5.5, we are told that o’ = 352.9, so the residual sum of squares is 2864383. We can 
n 


calculate ye —J,,)° directly from the data in Table 11.13. It equals, 26844478. It follows that 


i=l 


2864383 


R2=1-———* 
26844478 


= 0.893. 
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30. Use the notation in the solution of Exercise 26. Suppose that c’@ = d. Then U has a noncentral t 
distribution with n—>p degrees of freedom and noncentrality parameter (d—c)/[o(e!(Z’Z)~‘c)'/?]. The 
argument is the same as any of those in the book that derived noncentral t distributions. 


11.6 Analysis of Variance 


Commmentary 


If one is using the software R, there is a more direct way to fit an analysis of variance model than to construct 
the design matrix. Let y contain the observed responses arranged as in Eq. (11.6.1). Suppose also that x is 
a vector of the same dimension as y, each of whose values is the first subscript 7 of Y;; in Eq. (11.6.1) so that 
the value of x identifies which of the p sample each observation comes from. Specifically, the first n, elements 
of x would be 1, the next n2 would be 2, etc. Then aovfit=lm(y~factor(x)) will fit the analysis of variance 
model. The function factor converts the vector of integers into category identifiers. Then anova(aovfit) 
will print the ANOVA table. 


Solutions to Exercises 


1. By analogy with Eq. (11.6.2), 


ny O 0 1/n, =O 0 
0 Ty eee 0 0 1 ng 0 
Zz=|. . _ | and (Z’Z)-1 = I 
0 O +++ Np 0 QO te Tin, 
Also, 
m4 
Ny 7 
j=l ie 
ZY = and (Z'Z)1Z'Y = 
Np Ye 
>i 
j=l 
2. Let A be the orthogonal matrix whose first row is u in the statement of the problem. Define 
ni? V4 
Y= “a 
nel ?Y p+ 


Let v’ be a vector that is orthogonal to u (like all the other rows of A.) Then v/X = v'Y/c. Define 


U = AX =(U,,...,Up), 
V = AY=(V,...,Vp/. 


We just showed that V;/o = U; for 1 = 2,...,n. Now, 


X’X = (AXY(AX)= S002, 
i=1 

(AY (AY) = V2, 
i=1 


Y'Y 
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P 
Notice that V, = se mY i¢/n? = nV?Y,,, hence 


i=1 
S2 ” 1 Pp 5 9 1 P V; 2 Pp 
wBow = SY PV = (YY -W) => (=) =p Rey 
i=1 i=2 i=2 


Now, the coordinates of X are i.i.d. standard normal random variables, so the coordinates of U are 
P 
also i.i.d. standard normal random variables. Hence a U? = Siew /o” has a x? distribution with p—1 


i=2 
degrees of freedom. 


3. Use the definitions of Y;, and Y,, in the text to compute 


Pp Pp Pp 
Smee Kay = Soar any. 2¥. ae 
i=1 i=1 i=l 

P — = — 
= Somy nye. On ha 


4. (a) It is found that Y;, = 6.6, Yo, = 9.0, and Y3, = 10.2. Also, 


ny n2 n3 
>> (Yj — Yi4)? = 1.90, $9 (Vo; — Yor)? = 17.8, 5° (Vay — Yay)? = 5.54. 
j=l j=l j=l 


Hence, by Eq. (11.6.4), 6? = (1.90 + 17.8 + 5.54)/13 = 1.942. 


(b) It is found that Y;, = 8.538. Hence, by Eq. (11.6.9), U? = 10(24.591) /[2(25.24)] = 4.871. When 
Hp is true, the statistic U? has the F distribution with 2 and 10 degrees of freedom. Therefore, 
the tail area corresponding to the value U? = 4.871 is between 0.025 and 0.05. 


5. In this problem n; = 10 for i = 1,2,3,4, and Yj, = 105.7, Yo, = 102.0, ¥3, = 93.5, Ya, = 110.8, and 


Yo. = 103. 
10 _ 10 7 
Di%j — Kz)? = 3038, $0 (Wj — You)? = 544, 
j=l j=l 
10 _ 10 7 
S>(¥3j — Yaz)? = 250, S°(¥aj — Yay)? = 364. 
j=l j=l 


Therefore, by Eq. (11.6.9), 


36(1593.8) 


UV? = 
3(1461) 


= 13.09. 


When Hp is true, the statistic U? has the F distribution with 3 and 36 degrees of freedom. The tail 
area corresponding to the value U? = 13.09 is found to be less than 0.025. 
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6. The random variables Q),...,@, are independent, because each Qj; is a function of a different group 
of the observations in the sample. It is known that Q;/o? has a x? distribution with n; — 1 degrees of 
p 


freedom. Therefore, (Qi +---+Qp,)/o? will have a x? distribution with So (ni —1)=n-—p degrees of 


i=l 
| e2 = 
Salle a U1 will have the F’ distribution with n; — 1 
Qp/[o? (mp — 1)] 


and n» — 1 degrees of freedom. In other words, (np — 1)Q1/[(m1 — 1)Q,] will have that F distribution. 


freedom. Since Q; and @, are independent, 


7. If U is defined by Eq. (9.6.3), then 


(m+n —2)(Xm — Ya)? 
(444) (s2+52) 


m 


= 


The correspondence between the notation of Sec. 9.6 and our present notation is as follows: 


Notation of Sec. 9.6 Present notation 


m Ny 

nm ng 

Xen Vii 

n Yo. 

TY 7 

Sx > ay Fig) 
j=l 
n2 _ 

oe S° (Ya; — You)? 
j= 


Since p = 2, Ya4 = 1Y14/n + noYo,/n. Therefore, 
ma(Yiz — 44)? + meo(Yor — ¥44)? 
ng : — i 2 ny 2 = — 2 
= nm (=) (ie = You se dia — (Ye — Foy) 


since nj + ng = Nn. 


| 
= 
+ 

| 
Aas \ 
oe 


Also, since m+n in the notation of Sec. 9.6 is simply n in our present notation, we can now rewrite 
the expression for U? as follows: 


(n — 2) Soi = Yea) 
2 er =| 
U _ n 2 Ny 
Yi; — Y; 
nino 22. a i+) 


This expression reduces to the expression for U? given in Eq. (11.6.9), with p = 2. 


> Yig — Yin)? 


j=l 


1 Pn _ 7 
8 | — ee i 
ae a ee | 


1 Pp 
Se ay 
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1 P 
= So (ni —1)o”, by the results of Sec. 8.7, 
MP ial 
1 
= (n — p)o? = a”. 
up 


9. Each of the three given random variables is a linear function of the independent random variables 
Vrs (8 = Lae atte Sd F = Lyons gD) 


1 fl 
Let W, = Ne Gare; W2 = 2 brsYrs, and W3 = Soe We have ajj = 1 — a dis = = for 
r,s T,8 a a 
. . 1 1 1 ye os 1 
s#j, and a,s = 0 for r #7. Also, by, = — — — and b,, = —— for r #7. Finally, c,; = — for all 
Ny! n n n 


values of r and s. 
Now, Cov(W, W2) = Cov [s Gia Yeas a bee) - s Wisltgh COV pas Vote) 
r,s T's! T,8 r,s! 


But Cov(Y;s, Ys’) = 0 unless r = r’ and s = s’, since any two distinct Y’s are independent. Also, 
Covi¥ia, Yoo) = Varn) =o. 


Therefore, Cov(W1, W2) = 0? Drs Orsbrs. If i = a’, 


1 1 1 1 1 1 
) rs = aijbig + D— aisdis +0 = (1-—])(—-- cle | | 
alee ei 2 eee ¢ & ie )( 2) (;. : 


"8 sf) Ms 
IfiA?, 
1 1 1 1 
sate=(1-2)(-2sm-n(-2)(4)-0 
r,s Ny mr Ny nr 
Similarly, 


1 
Cov(W1, W3) = 07S arscrs = o— (o + vi] = 


ns s#i 


Finally, 


r,s r,s 


10. (a) The three group averages are 825.8, 845.0, and 775.0. The residual sum of squares is 1671. The 
ANOVA table is then 


Source of Degrees of 
variation freedom Sum of squares Mean square 


Between samples yi 15703 7851 
Residuals 15 1671 111.4 


Total 17 17374 


11. 


12. 
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(b) The F statistic is 7851/111.4 = 70.48. Comparing this to the F distribution with 2 and 15 degrees 
of freedom, we get a p-value of essentially 0. 


Write 5h... = > 4 vy 4 —n¥ aus oe that the mean of the square of a normal random variable with 
mean p and variance o? is 2 + 07. The distribution of Y;, is the normal distribution with mean 4; 
and variance o?/n;, while the dstributian of Y 4 is the normal distribution with mean 7 and variance 
o*/n. Hence the mean of $2,,,, is 


do ni(u? + 0? /m) — n+ 0?/n). 
i=1 


If we collect terms in this sum, we get (p — 1)o? + 7?_, niu? — nqi?. This simplifies to the expression 
stated in the exercise. 


If the null hypothesis is true, the likelihood is 


Pon 
(21)-"/2g- exp (-2 Io? ey Yij — v] ‘ 


i=1 j=1 
This is maximized by choosing p = Y7,4, and 


P 


1 7 i 
o = — Cr —i4) = 7 Tot 
eA 


The resulting maximum value of the likelihood is 
n/2 
ena ane 


If the null hypothesis is false, the likelihood is 


p ni 
(21)-"/2g-" exp (-2 3o2 Pe Ug = pa) ; F 


i=1 j=l 
This is maximized by choosing pu; = 7, and 
2_1< an ee 
an 2 (vis ls 7 Resid 
The resulting maximum value of the likelihood is 
n/2 
yr exp (-5) : (S.11.5) 


The ratio of (S.11.5) to (S.11.4) is 


n/2 n/2 
( STot -(1 ate ges | . 
Shesia Shesid 


Rejecting Hp when this ratio is greater than k is equivalent to rejecting when U? > k’ for some other 
constant k’. In order for the test to have level ag, k’ must equal the 1 — ap quantile of the distribution 
of U? when Hp is true. We saw in the text that this distribution is the F distribution with p— 1 and 
n — p degrees of freedom. 


(Qn)-"/? 
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13. We know that S2., = SPomw + SResiq- We also know that $2,,,,/o7 and S2..i4/07 are independent. If 
Hp is true then they both have x? distributions, one with p— 1 degrees of freedom (see Exercise 2) 
and the other with n — p degrees of freedom as in the text. The sum of two independent y? random 
variables has x? distribution with the sum of the degrees of freedom, in this case n — 1. 


Pp 
14. (a) (The exercise should have asked you to prove that )7?_, nia; = 0.) We see that So nai = 

n i=1 
S- Ni pi — Nyt = O. 
i=1 

(b) The M.L.E. of y; is Y;, and the M.L.E. of p is Y,4, so the M.L.E. of a; is Yi, —Y 44. 

(c) Notice that all of 4; equal each other if and only if they all equal y, if and only if all a; = 0. 

(d) This fact was proven in Exercise 11 with slightly different notation. 


11.7 The Two-Way Layout 


Commmentary 


If one is using the software R, one can fit a two-way layout using 1m with two factor variables. As in the 
Commentary to Sec. 11.6 above, let y contain the observed responses, and let x1 and x2 be two factor 
variables giving the levels of the two factors in the layout. Then aovfit=lm(y~x1+x2) will fit the model, 
and anova(aovfit) will print the ANOVA table. 


Solutions to Exercises 


I 

1. Write $4 = J Le ot JY, ,. Recall, that the mean of the square of a normal random variable with 
i=1 

mean j and variance o? is 2? + o?. The distribution of Y;, is the normal distribution with mean 1; 


and variance o?/J, while the distribution of Y;4 is the normal distribution with mean py and variance 
I 


o*/IJ. Hence the mean of $% is JS (ue + 07/J) —IJ(u+o7/[LJ]). If we collect terms in this sum, 
i=1 
we get 


I I I 
(I —1)o? + J >> pe lira (Ps ie + I> 0 (ui pr =a +J> af. 
i=1 i=1 i=1 


2. In each part of this exercise, let ju;; denote the element in row 7 and column j of the given matrix. 


(a) The effects are not additive because jig] — iy = 5 F M22 — fig = 7. 


(b) The effects are additive because each element in the second row is 1 unit larger than the corre- 
sponding element in the first row. Alternatively, we could say that the effects are additive because 
each element in the second column is 3 units larger than the corresponding element in the first 
column. 


(c) The effects are additive because each element in the first row is 5 units smaller than the corre- 
sponding element in the second row and is 1 unit smaller than the corresponding element in the 
third row. 


(d) The effects are not additive because, for example, fv21 — Wi1 = 1 A a2 — pig = 2. 
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3. If the effects are additive, then there exist numbers 0; and WV; such that Eq. (11.7.1) is satisfied for 
i=1,...,landj=1,...,J. Let 0=4Dj_, 0; and W = 5 /_, W,, and define. 


b= O-Fe, 
Qa = 0; -—O for 6= lyases 
Bj = U;-wv for fo! neers 2 


Then it follows that Eqs. (11.7.2) and (11.7.3) will be satisfied. 


It remains to be shown that no other set of values of 1, a;, and £; will satisfy Eqs. (11.7.2) and (11.7.3). 
Suppose that yu’,a/,, and {% are another such set of values. Then pp + a; + 8; = pw! + aj + 8; for all i 
and j. By summing both sides of this relation over i and 7, we obtain the relation Ju = IJ’. Hence, 
= pw’. It follows, therefore, that a; + 8; = aj + @% for all i and 7. By summing both sides of this 
relation over j, we obtain the result a; = a’, for every value of 7. Similarly, by summing both sides over 
7, we obtain the relation 6; = 3 for every value of j. 


4. If the effects are additive, so that _ (11.7.1) is satisfied, and we denote the elements of the 
matrix by wij = 0; + Wj, then fz, = O+V,f;, = O; + VU, and fy; = O+V;. Therefore, 


Mig = P44 + (Hit — H+) + (B4j — 4+). Hence, it can be verified that u = 44,0; = fit — i++ and 
BP; = Pag — P44 In this exercise, 


1 
4+ = qst6+4+7) =5, 

1 1 
fi+ = 5 (3 + 8) = 4.5, fie, = 9447) = 5.5, 

1 1 

5. In this exercise, 

- i38 1 
lise = SoS omy = 5 (39) = 3.25, 

2s 12 

i=1j=1 

5 25 9 
fj, = == 1.25, fioxy = — = 6.35, fig, = — = 2.25, 
Mi+ A » M24 A » M34 rn 

15 3 6 15 
I] = —S— 5 LL — 1 i — — =) =5, 
M+1 3 »H+2 3 » M43 3 » M44 = 3 


It follows that a, = 1.25 — 3.25 = —2, ag = 6.25 — 3.25 = 3, and a3 = 2.25 — 3.25 = —1. Also, 
By = 5 — 3.25 = 1.75, Bg = 1 — 3.25 = —2.25, 63 = 2 — 3.25 = —1.25, and 64 = 5 — 3.25 = 1.75. 


Tr iIeg J 
6. Qa; = a & ea Y;; - = Y;; = 0. A similar argument show that 3. = 0. 
28 : - +4 ps j ye j eo 26) 
Also, since 'E(¥y) = =p-Pae + By, 
_ 1 La 1 es 
Ba) = B@)=p BM) = Vu to + 8) = Jn +040) = 
IJ fa jal IJ = jal IJ 
- : i2 iZ 
B(@) = EW. -Y44)=# = UY =F) (uta + Bj) — 4 = 04. 
j=l j=l 


A similar argument shows that E (6;) = B; 
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1 1 
7. Var(f) = Var (5 Dy Xs) = ap Se vant, ij) = ppllo = ae 


=14>= i=1 j=1 
The estimator @; is a linear function of the JJ independent random variables Y,, (r = 1,...,J and 
ar 
1 1 
= Tiwkeg ds at let a; = Y;-s, then it be found that a;, = = — — f =a re A 
8 ) we let a; dd ors rs, then it can be foun at dis = > — 7 for s 
1 
and a,, = “FF for r #7. Therefore, 


Var (A) Se : ane 1) +0 v1(4) =e 
ar(a;) = G,..0° =o —-— — — — = o. 
ak I TJ TJ 


The value of Var(8;) can be found similarly. 


8. If the square on the right of Eq. (11.7.9) is expanded and all the summations are then performed, we 
will obtain the right side of Eq. (11.7.8) provided that the summation of each of the cross product 
terms is 0. We shall now establish this result. Note that 


J 
>i %i5 — Vie — Yay + Y44) = J¥in — JVi4 — JV 44 + JV 44 = 0. 
j=l 


Z 
Similarly, yO; —Y;4 —¥4; + Y;4) =0. Therefore, 
j=l 
IJ - oo - Eb . Z 
ee a ee eae) = De Kee eet 
i=1 j=l i=1 j=l 
io - - ee - a I - - - 
Se ae Fes ee) = 2 FD Oe a ee ee 
i=1 j=1 j=l i=l 
Finally, 
i ae oe - I ri 
Ue = a a=) oe) 2 (¥45 = Y+4+) =0x0. 
i=1 j=l i=l j=l 


9. Each of the four given random variables is a linear function of the independent random variables 
el? = locood Bnd @= 1,5 ye le Let 


dy, 
= S- ~ OrsY rs, 


f=1s=/1 
Pe oJ. 


a s o brs Yrs, 


r=l¢6=1 


Id 
W3 = S- > Crs Yrs. 


f=1s=1 
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Then 
1 1 1 1 1 
Go lass F771 ari = TF TT f , 
og aL? ye og ge 
1 1 1 
Gs = “Zt for s#Jj, and tre = Fy for rAi, and sj. 
Also, 
1 1 1 
= 37 FF rs =z fe 
bins a and b TW or ri 
1 1 1 
= sos rs = —z> for is 
Ca 7 and c¢ 7 s#j 


As in Exercise 12 of Sec. 11.5, Cov(W1, W2) = o 3s Geglee: 1G =, 


r,s 


pon fbb G-tio-al GD 
s-(-fr8) (ah) 10-ne-n() Ch) = 

iki, 

Yate = (-5-7+75)(-7) (i,j term) 

+(J—1) (-5 + 7) (-z) (i,s terms for s 4 7) 
(GB)G-B) sem 
tsi) (=) (5 = =) (i’, 7 terms for s 4 7) 
AF =) (-7 4 WI (-=) (r,j terms for r # i, 7’) 
+ =2)( t= 1) - (-3) (r,s terms for r #i,7’ and s 4). 


Similarly, the covariance between any other pair of the four variances can be shown to be 0. 


I I I I 


I 
10. ) (Yt — ¥a4)" = DV - 244 Yi IVE, =D YG - YE, +E, = D4 — IPE. 
i=1 i=1 i=1 w=1 i=1 

The other part of this exercise is proved similarly. 


IJ 
11. SOS (ij - Yin - Yap + Ya’ = Lh pl Ne 
i=1j=1 
2 DWF AE Thy 2h DEM 


+25° > YP S ay > Vi4 — 2T¥44 > Yay 
ij F j 
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12. 


13. 


14. 


15. 


16. 
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= YG+ IS VA +19 V2, 4+ LIV? 
23 S792, — 212, + 22, 
+2 1s¥?, ~ 21I¥2, —21JY}, 
= Yg - ID A - TOY? + 13, 
t 9 a j 
Tt.48 found ‘that Yie = 17.4, You = 15.94, ¥o = 1708, Fog = 15:1, Yoo = 140, ¥ 15 = 15.5334, %o4 = 


19.5667, Y+5 = 19.2333, Y,, = 16.8097. The values of ji, @; for i = 1,2,3, and 6; for j = 1,...,5, can 
now be obtained from Eq. (11.7.6). 


The estimate of E(Yj;) is &@#+d;+ B;. From the values given in the solution of Exercise 12, we therefore 
obtain the following table of estimated expectations: 


1 2 3 4 5 
1 15.6933 15.1933 16.1267 20.16 19.8267 
2 14.2333 13.7333 14.6667 18.7 18.3667 


3 15.3733 14.8733 15.8067 19.84 19.5067 


Furthermore, Theorem 11.7.1 says that the M.L.E. of o? is 
1 
= Ta (29.470667) = 1.9647. 


It is found from Eq. (11.7.12) that 


» _ 20(1.177865) 


— Si = 0.799. 
A (29.470667) 


When the null hypothesis is true, U3 will have the F distribution with J—1 = 2 and (I—1)(J—-1) =8 
degrees of freedom. The tail area corresponding to the value just calculated is found to be greater than 
0.05. 


It is found from Eq. (11.7.13) that 


6(22.909769) 
UZ = ———_" = 4 664. 
B  (29.470667) 


When the null hypothesis is true, U2, will have the F distribution with J—1= 4 and (I-1)(J—-1) =8 
degrees of freedom. The tail area corresponding to the value just calculated is found to be between 
0.025 and 0.05. 


If the null hypothesis in (11.7.15) is true, then all Y;; have the same mean py. The random variables 
S/o, S%/o7, and S?,,,,/07 are independent, and their distributions are x? with J — 1, J —1, and 
(I — 1)(J —1) degrees of freedom respectively. Hence $4 + $3 has the y? distribution with I + J —2 
degrees of freedom and is independent of o’. The conclusion now follows directly from the definition of 
the F' distribution. 
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11.8 The Two-Way Layout with Replications 


Commentary 


This section is optional. There is reference to some of the material in this section in Chapter 12. In 
particular, Example 12.3.4 shows how to use simulation to compute the size of the two-stage test procedure 
that is described in Sec. 11.8. 

If one is using the software R, and if one has factor variables x1 and x2 (as in the Commentary to 
Sec. 11.7) giving the levels of the two factors in a two-way layout with replication, then the following com- 
mands will fit the model and print the ANOVA table: 
aovfit=lm(y~x1*x2) 
anova (aovfit) 


Solutions to Exercises 


lL Let pp = O44,0; = O14 — 044, 8; = 04; — O44, and %i; = 94; Oi4 0,; O44 fori =1,...,J7 
and j = 1,...,J. Then it can be verified that Eqs. (11.8.4) and (11.8.5) are satisfied. It remains to be 
shown that no other set of values of 1, a;,8;, and y%; will satisfy Eqs. (11.8.4) and (11.8.5). 


Suppose that py’, a’, 8’, and 7/, are another such set of values. Then, for all values of i and j, 
be a J UW 


pag + By +g = wl + OG + BEG 


By summing both sides of this equation over 7 and j, we obtain the relation [Ju = IJ’. Hence, p = p’. 
It follows, therefore, that for all values of 7 and J, 


ai + By + Vig = 4 + Bi + Vy. 


By summing both sides of this equation over 7, we obtain the relation Ja; = Ja. By summing both 
sides of this equation over i, we obtain the relation 18; = 13/. Hence, a; = ai; and 6; = £3. It also 
follows, therefore, that 7; = Nes 


2. Since Vue is the M.L.E. of 6;; for each 7 and j, it follows from the definitions in Exercise 1 that the 
M.L.E.’s of the various linear functions defined in Exercise 1 are the corresponding linear functions of 


Y;;+. These are precisely the values in (11.8.6) and (11.8.7). 


3. The values of p,a;,8;, and 7; can be determined in each part of this problem from the given values of 
©;; by applying the definitions given in the solution of Exercise 1. 


I I 
4 oa = Oia — Yh) = 8 — V4 = 0, 
i=l i=l 
I L - - - - - - 7 
ae = oe — ee ge te) = ge a ae Fe HO 
i=1 w=1 
J J 


The proofs that Ss B; = 0 and s 43 = 0 are similar. 
7=1 j=l 
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‘ i il i! 7 
5. E(f) = TK ie ENY 6)= TK s. 04 = ae 0;; = O44 = p, by Exercise 1; 
id, i,j,k ij 
: 1 
E(aj) = TK >! Ein) EY a4) 
jk 
= z —»- 0;; — O44, by the first part of this exercise 
Uk 
1 2 = _ 
= 5) Oi — 844 = Gi — O44 = aH, by Exercise 1; 
. i] 
E(6;) = 6;, by a similar argument; 
é L my me 
Ej) = OB Vigr) = El oe 83), by Eig. (118.7) 
k 


= 0O4- 0,, — O47 + O44, by the previous parts of this exercise, 


= Yj, by Exercise 1. 


6. The IJK random variables Y;;, are independent and each has variance o?. Hence, 
Var(a) = Var | — vi ae Ay angles ee = z 
ar\({) = Var tik . k] = (IK)? 44 ar Yigjk) = (IKY? = TK’ 


he estimator @; is a linear function of the observations Yj;, of the form a; = ae, GrstYrst, Where 


i 1 foi 4 1 
: = oO CFF i an a SS eS 
Mt TR tik LIK pe TIR’ 


for r #7. Hence, 


Vv =a, = |3K (= ere IK ( : ) Bd See 
ani) Usp” = Tie ie) | > TR 


r,8,t 


Var(B;) can be determined similarly. Finally, if we represent 4; in the form 4; = )0,.4CrstYrst, then 
it follows from Eq. (11.8.7) that 


: : : + : for f= esd, 
Gp = Se Se = SS + SS or t= 
uh JK IR TK ener 
1 1 
Crit = Fe t TyR re Ft and b= Web ewey Ky 
1 1 
Cist = TK* WK for sj -and- t= 1,554.4, 
Crst = ap lor #i s#j, and t=1,...,K. 
Therefore, 
Var (Vis) = ee 


Tis3b 
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1 ik 1 i ee ik 1 4? 
a ae ee (4 —— 
( Wie z+a) ua ) ( zta) 


4(J -1)K (-= zn nz) +(I-1)(J—-)K (Ge) | ‘e 


7. First, write S?,, as 


ea Oe i et ae ee ee ee SIL 
inik 


The sums of squares of the grouped terms in the summation in (S.11.6) are $2.34, Sf, $4, and S%. 
Hence, to verify Eq. (11.8.9), it must be shown that the sum of each of the pairs of cross-products 
of the grouped terms is 0. This is verified in a manner similar to what was done in the solution to 
Exercise 8 in Sec. 11.7. Each of the sums of (Yijz — Yij+) times one of the other terms is 0 by summing 
over k first. The sum of the product of the last two grouped terms is 0 because the sum factors into 


sums of 7 and j that are each 0. The other two sums are similar, and we shall illustrate this one: 


Summing over 7 first produces 0 in this sum. For the other one, sum over 7 first. 


8. Each of the five given random variables is a linear function of the IJK independent observations Y,-¢¢. 
Let 


Aa = 5 GrstY rst; 
T,S,t 

By, = 5 brstYrsts 
r,8,t 

‘Yioj2 = Si Guten 
T St 


Yijr _ Yij+ = » dystY rst: 


r,s,t 


Of course, fi = a Ypst/[I JK]. The value of ayst, bpst,and Cps¢ were given in the solution of Exercise 6, 


TS3t 
and 
1 
dijk = 1- = 
1 
ijt = ~The for t#k, 
dyst = O otherwise. 


To show that @;, and Bir are uncorrelated, for example, it must be shown, as in Exercise 9 of Sec. 11.7, 
that eer ArstOrst = 0. We have 


Yanabe = Ke) (Ge) + 0-08 (GR) (-FGR) 


r,8,t 
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stb) (Sat) #40008) 
= 0. 


Similarly, to show that Yj;;— are and 4j.j;. are uncorrelated, it must be shown that >0,..4Crstdrst = 0. 
Suppose first that ¢ = ig and 7 = jo. Then 


ee (a 
careers “" \K JK IK" TK K 
ae ae 1 1 
K-1)(5-—-—+—{)(-=) =0. 
+ (x IK tam) z) : 


Suppose next that 7 = ig and 7 # jo. Then 


Do Crstdrst = (- +a) (1 x) t (K ( z ! We (-=) =0. 


r,syt 


The other cases can be treated similarly, and it can be shown that the five given random variables are 
uncorrelated with one another regardless of whether any of the values j, 7, and jg are equal. 


9. We shall first show that the numerators are equal: 


dois — Yin — Yas + Yo)? = DOR + VAL +R + PR - 2% Yi 


ij 4,J 
—2Y 54 ¥4j4 + 2%igt Yeas + 2% 4 V5 
a re see) aera Sees 


= VY +I +I +R 


—2I 7 ¥i44 — 200 V7, + 20V2, 
i j 
+21IV2,, —21IV¢,, -—20IV2 44 


= Voie eee l Yeti ies, 
tJ a J 


Next, we shall show that the denominators are equal: 


pee — Via = » (Yin = 2Y ign Yig+ re 54) 


ijk ijk 
= © (SvGe-2673, +473.) 
aj k 
-_ eee —-K) ¥#,. 
i,j,k 1:9 


10. In this problem, J = 3, J = 4, and K = 2. The values of the estimates can be calculated directly from 
Eqs. (11.8.6), (11.8.7), and (11.8.3). 


Al 


12. 


13. 


14. 


15. 
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It is found from Eq. (11.8.12) that U3, = 0.7047. When the hypothesis is true, U3, has the F 
distribution with (J—1)(J—1) = 6 and IJ(K —1) = 12 degrees of freedom. The tail area corresponding 
to the value just calculated is found to be greater than 0.05. 


Since the hypothesis in Exercise 11 was not rejected, we proceed to test the hypotheses (11.8.13). It is 
found from Eq. (11.8.14) that U4 = 7.5245. When the hypothesis is true, U3 has the F distribution 
with (J — 1)J = 8 and 12 degrees of freedom. The tail area corresponding to the value just calculated 
is found to be less than 0.025. 


It is found from Eq. (11.8.18) that UZ, = 9.0657. When the hypothesis is true, Uz, has the F distribution 
with J(J — 1) =9 and 12 degrees of freedom. The tail area corresponding to the value just calculated 
is found to be less than 0.025. 


The estimator ji has the normal distribution with mean y and, by Exercise 6, variance 07/24. Also, 
mee: - Vice) / o” has a x2 distribution with 12 degrees of freedom, and these two random variables 
i,j,k 

are independent. Therefore, when Ho is true, the following statistic V will have the ¢t distribution with 
12 degrees of freedom: 


V24(ji — 8) 
E eee 


i,j,k 


V= aE: 


We could test the given hypotheses by carrying out a two-sided t test using the statistic V. Equivalently, 
as described in Sec. 9.7, V? will have the F distribution with 1 and 12 degrees of freedom. It is found 
that 


ve 24(0.7708)? 


i = 16.6221. 
— (10.295) 
12 


The corresponding tail area is less than 0.025. 


The estimator d2 has the normal distribution with mean a and, by Exercise 6, variance 07/12. Hence, 
as in the solution of Exercise 14, when ag = 1, the following statistic V will have the ¢ distribution 
with 12 degrees of freedom: 


V12(a2 — 1) 
E S> (Yin — Yij+)? 


i,j,k 


cS 1/2 


The null hypothesis Hp should be rejected if V > c, where c is an appropriate constant. It is found 
that 


vy = —¥120.7667)_— _ 5 9673 


55 (10.295) - 


The corresponding tail area is between 0.005 and 0.01. 
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17. 


18. 


19. 
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Since E(Y¥j;~) = w + a4 + By + Yaz, then E(¥ij+) = pu+ajit+ Bj + yj. The desired results can now be 
obtained from Eq. (11.8.19) and Eq. (11.8.5). 


I Ieoegd 
1 = 
5 a;=—)S 5 Yiuj4 —-Ip = 1p —Ip=0 and 
= a i em at bt Lt b 


It can be shown similarly that 


J . J 
> = 0 and :> Vij =) 
y=1 


j=l 
IJ 
Both f and d; are linear functions of the Ss independent random variables Y;;,. Let f = 
i=1 j=l 
S- MrstYrst and A; = a GrstYrst- Then it is found from Eq. (11.8.19) that 
T,S;t r,s,t 
Mprst = TK. for all values of r,s, andt, 
and 
1 1 
re _ 
a TKis IJKis’ 
1 . 
Giese “TK. for r #14. 
As in the solution of Exercise 8, 
Cov({i, d;) = a?) MipsiGrsi 
T;8,t 
a Kis Krs 
= 1S oom (Ga Tame) LEY (ae) Ora) 
Si JI Kis \IKis IS Kis ee as LI Ks 
| 1 oi 
2 
= 9 15372 a ~ 7272 Dead 
Ey Je Ks Ld sk Kys 
o ei fe es 
= (I-1)>° — -» 
PJ? s=1 Kis r=1s=1 Krs s=1 Kis 
o 12 A | 
= a ag ee 
PJ? s=1 Kis r=1s=1 Krs 


Notice that we cannot reject the second null hypothesis unless we accept the first null hypothesis, since 
we don’t even test the second hypothesis if we reject the first one. The probability that the two-stage 
procedure rejects at least one of the two hypotheses is then 


Pr(reject first null hypothesis) 
+ Pr(reject second null hypothesis and accept first null hypothesis). 
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The first term above is ag, and the second term can be rewritten as 


Pr(reject second null hypothesis|accept first null hypothesis) 
x Pr(accept first null hypothesis). 


This product equals 89(1 — ao), hence the overall probability is ag + 89(1 — ao). 


20. (a) The three additional cell averages are 822.5, 821.7, and 770. The ANOVA table for the combined 


samples is 
Source of Degrees of Sum of 
variation freedom squares Mean square 


Main effects of filter 1 1003 1003 

Main effects of size 2 25817 12908 
Interactions 2 739 369.4 
Residuals 30 1992 66.39 
Total 35 29551 


(b) The F statistic for the test of no interaction is 369.4/66.39 = 5.56. Comparing this to the F 
distribution with 2 and 30 degrees of freedom, we get a p-value of 0.009. 


(c) If we use the one-stage test procedure in which both the main effects and interactions are hypoth- 
esized to be 0 together, we get an F' statistic equal to [(25817 + 739) /4]/66.39 = 100 with 3 and 
30 degrees of freedom. The p-value is essentially 0. 


(d) If we use the one-stage test procedure in which both the main effects and interactions are hypoth- 
esized to be 0 together, we get an F' statistic equal to [(1003 + 739)/3]/66.39 = 8.75 with 3 and 
30 degrees of freedom. The p-value is 0.0003. 


11.9 Supplementary Exercises 


Solutions to Exercises 


1. The necessary calculations were done in Example 11.3.6. The least-squares coefficients are Bo = —0.9709 
and 6, = 0.0206, with o’ = 8.730 x 10-3, and n = 17. We also can compute s? = 530.8 and Z, = 203.0. 


(a) A 90% confidence interval for 6; is 6, + T74,(0.95)o’/s,z. This becomes (0.01996, 0.02129). 
(b) Since 0 is not in the 90% interval in part (a), we would reject Ho at level ag = 0.1. 


(c) A 90% prediction interval for log-pressure at boiling-point equal to x is 
1/2 
a A 1 = a," 
Bo + By + T.,1,(0.95)o’ (: + . + nope . 


x 


With the data we have, this gives [3.233, 3, 264]. Converting this to pressure gives (25.35, 26.16). 


2. This result follows directly from the expressions for 6,41, and G2 given in Exercise 24 of Sec. 7.6 and 
the expression for 6; given in Exercise 2a of Sec. 11.1. 


3. The conditional distribution of Y; given X; = x7; has mean $9 + 6,2;, where 


02 
Bo = 2 — — fy and ~, = —, 
O71 O1 
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and variance (1—*)o3. Since T = (1, as given in Exercise 2b of Sec. 11.1, it follows that E(T) = 61 = 
po2/o1 and 


Var(T) = b= 92% 
i — En) 
4=1 


3 


. The least squares estimates will be the values of 61,42, and 03 that minimize Q = Soy — 0;)", where 


j=l 
63 = 180 — 0, — 62. If we solve the equations 0Q/00; = 0 and 0Q/002 = 0, we obtain the relations 


Y1 — 91 = yo — 02 = y3 — 93. 


3 3 

Since yu = 186 and y 6, = 180, it follows that 6; = y; — 2 fori = 1,2,3. Hence 6, = 81, 65 = 45, 
i=l i=l 

and 6 = 54. 


. This result can be established from the formulas for the least squares line given in Sec. 11.1 or directly 


from the following reasoning: Let 7; = a and x2 = b. The data contain one observation (a, y1) at 
x =a and n— 1 observations (b, y2),...,(b, yn) at x = b. Let u denote the average of the n — 1 values 
Y2,-++,Yn, and let hg and hy denote the height of the least square line at x = a and x = B, respectively. 
Then the value of Q, as given by Eq. (11.1.2), is 


Q = (yi — ha)? + > (yi — hs)”. 
j=2 


The first term is minimized by taking hg = y; and the summation is minimized by taking hy = u. 
Hence, Q is minimized by passing the straight line through the two points (a, y;) and (b, uw). But (a, y1) 
is the point (1, y1). 


. The first line is the usual least squares line y = 8) + 8,2, where 3; is given in Exercise 2a of Sec. 11.1. 


In the second line, the roles of « and y are interchanged, so it is x = @, + Gay, where 


Mz 


(xi — Zn)(Yi — Yn) 


>» Ww > Un)* 


i=1 


Both lines pass through the point (Z,, J), so they will coincide if and only if they have the same slope; 
i.e., if and only if By = 1/d2. This condition reduces to the condition that p? = 1, where / is given in 
Exercise 24 of Sec. 7.6 and is the sample correlation coefficient. But 6? = 1 if and only if the n points 
lie exactly on a straight line. Hence, the two least squares lines will coincide if and only if all n points 
lie exactly on a straight line. 


. It is found from standard calculus texts that the sum of the squared distances from the points to the 


line is 
Si — Bi — Baxi)? 


1+ 6? 
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The equation 0Q/06, = 0 reduces to the relation 6; = Yn — B2%,. If we replace (; in the equation 
0Q/082 = 0 by this quantity, we obtain the relation: 


n nm 


(1+ 83) = So [(yi — Jn) — B2(@i — En)\xi + B2 S“[(Yi — Jn) — Bo (ai — Zn)]? = 0. 


i=l i=1 


Note that we can replace the factor x; in the first summation by x; — %, without changing the value 
of the summation. If we then let x, = x; — Z, and y; = y; — Yn, and expand the final squared term, we 
obtain the following relation after some algebra: 


n 


(63-1 ID + Bo 5° (a? — y/7) = 0. 


c=] 


Hence 


1/2 


2 (u? — 2?) + ce a - a +4 (>: ct) ; 


Either the plus sign or the minus sign should be used, depending on whether the optimal line has 
positive or negative slope. 


. This phenomenon was discussed in Exercise 19 of Sec. 11.2. The conditional expectation F(X 2|X,) of 
the sister’s score X2 given the first twin’s score X, can be derived from Eq. (5.10.6) with 4 = w2 = pw 
and 0, = 02 =o. Hence, 


E(X2|X1) = w+ o(X1 — pw) = (1— p)ut pX, 


which is between js and X;. The same holds with subscripts 1 and 2 switched. 


kooni 
vw = ~ ley — Bt + By — Ba? 

i=1j=1 
ko ni 

= “23 (ej — Fe)? 4+ — = Yom(Gin — 2a) 
i=1j=1 
k 

= “nile? + (tee = 2a), 
i=1 


10. In the notation of Sec. 11.5, the design matrix is 
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For = w,z,Y andu=w,2z,Y, let Sp, = i, tw. Then 


n ee Saini Swe 
ZZ= i | , 
1 5 =5 
; =i _ ee LL WX 
(4 Z) 7 SwwI%xx _ ce 8, Sww | 
! _ Sw 
Hence, 
1 SeaSwy — Swa: 
; _1») ee ee rrVPwY w“rP ary 
(Z Z) in Sww Sax — ong cae = Sse | , 


The first component on the right side of this equation is Bo and the second is By. 


It was shown in Sec. 11.8 that the quantity $?..;4/07 given in Eq. (11.8.10) has a x? distribution with 
I J(K —1) degrees of freedom. Hence, the random variable $?..,;/[[J(K — 1)] is an unbiased estimator 
of o?. 


It follows from Table 11.23 that if a; = 8; =0 fori =1,...,f andj =1,...,J, and Q = S2 +S? then 
Q/o? will have a x? distribution with (I — 1) + (J —1) =I + J — 2 degrees of freedom. Furthermore, 
regardless of the values of a; and B;, R = S}..,4/07 will have a x? distribution with (I — 1)(J — 1) 
degrees of freedom, and @ and R will be independent. Hence, under Hp, the statistic 


7 E=D-He 
C2 Ion 

will have the F distribution with J + J —2 and (J —1)(J — 1) degrees of freedom. The null hypothesis 

Ho should be rejected if U > c. 


Suppose that a; = 8; = yj; = 0 for all values of i and j. Then it follows from Table 11.28 that 
(S3+ 52 +S? ,)/o? will have a y? distribution with (I —1)+(J—1)+(/-1)(J—-1) = IJ—1 degrees 
of freedom. Furthermore, regardless of the values of a;, 8; and 7i;, Sesiq/7” Will have a x? distribution 
with I.J(K — 1) degrees of freedom, and S%, + $7, +57, and Sa edd will be independent. Hence, under 
Ao, the statistic 


py — LIK = 154 + Si + Sine) 

7 (IJ = 1) SResia 

will have the F distribution with JJ — 1 and IJ(k — 1) degrees of freedom. The null hypothesis Ho 
should be rejected if U > c. 


The design in this exercise is a two-way layout with two levels of each factor and K observations in 
each cell. The hypothesis Hp is precisely the hypothesis Hp given in (11.8.11) that the effects of the 
two factors are additive and all interactions are 0. Hence, Ho should be rejected if U4, > c, where 
U4, is given by (11.8.12) with J = J = 2, and U3, has an F distribution with 1 and 4(K — 1) degrees 
of freedom when Hp is true. 
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15. Let Y; = Wy, Yo = Wo — 5, and Y3 = 3 Ws. Then the random vector 


16. 


Bai 
Yo 
¥3 


y= 


satisfies the conditions of the general linear model as described in Sec. 11.5 with 


1 1 9 
Z=/1 i: Palas 
i —j 2 
Thus, 
ee. Poe A fosG..F Se is 
z'z=|" ale (Ae) = ley 3/3 |? 
and 
; H+ ly Hor 
n 1+ 7Yo+<¥3 
=|) =(@’zy12z'y 5 : 
2 — — — 
a 
Also, 
7 1 * B 
o = gr -2 8) 28) 
1 7 n : Z - - 
= 3 (ea 6, — 42)? + (Yo — 61) — 62)? + (%3 -— 614 62)?) . 


The following distributional properties of these M.L.E.’s are known from Sec. 11.5: (61, 62) and 6? 
are independent; (6), 02) has a bivariate normal distribution with mean vector (91,92) and covariance 
matrix 


ta \—-1 __ | 3/8 —1/8) 2. 
ozzy =| 3 Palas 


367/07 has a x? distribution with one degree of freedom. 


Direct application of the theory of least squares would require choosing a and 8 to minimize 


This minimization must be carried out numerically since the solution cannot be found in closed form. 
However, if we express the required curve in the form log y = log a+ 8 log x, and then apply the method 
of least squares, we must choose {9 and /; to minimize 


n 


Qo = > _ (log yi — 89 — Bi log ai)”, 


i=1 


394 


17. 


18. 


Chapter 11. Linear Statistical Models 


where 69 = loga and (6; = GB. The least squares estimates Bo and 8; can now be found as in Sec. 11.1, 
based on the values of log y; and log x;. Estimates of a and @ can then be obtained from the relations 
log @ = Bo and B a By. It should be emphasized that these values will not be the same as the least 
squares estimates found by minimizing Q, directly. 


The appropriateness of each of these methods depends on the appropriateness of minimizing Q; and Qo. 
The first method is appropriate if Y; = ax! + €;, where ¢; has a normal distribution with mean 0 and 
variance o*. The second method is appropriate if Y; = ax? €;, where loge; has the normal distribution 
just described. 


It follows from the expressions for $9 and }; given by Eqs. (11.1.1) and (11.2.7) that 


eq = Y= (Yap) — 1 


= ¥i- ‘i = _ 
8 
x 
ay, c _1_& -_ 
Sr 
-\-Y; E e (ai - Foley — — 

me n 8 

j#t o 

where s2 = jai (3 — En)”. Since Y;,...,Yp are independent and each has variance o, it follows that 


1 (x; = En)? 


n Be 


“sot [b+ Sale ty 


2 
ii Sx 


Let Q; = ++ oe Then 


n = = 2 
Var(e;) = o7(1 -_ Qi)? 4 o2 ‘s F + Ta — oO. 
gat # 
= o7((1— Qi)? + Qi — Q}| 
= o7(1-Q)). 


(This result could also have been obtained from the more general result to be obtained next in Exer- 
cise 18.) Since Q; is an increasing function of (a; — Zn), it follows that Var(e;) is a decreasing function 
of (a; — Zp)? and, hence, of the distance between x; and Zp. 


(a) Since @ has the form given in Eq. (11.5.10), it follows directly that Y — Z B has the specified 
form. 


(b) Let A= Z(Z'Z)!Z’. It can be verified directly that A is idempotent, i.c., AA = A. Since 
D=I1-A, it now follows that 


DD =(I-— A)(I- A) =II- AI-IA+AA=I-A-A+A=I-A=D. 


(c) As stated in Eq. (11.5.15), Cov(Y) = 0? I. Hence, by Theorem 11.5.2, Cov(W ) = Cov(DY ) = 
D Cov(Y)D!' = D(o7I)D = 0? (DD) =07D. 


19. 


20. 


21. 
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Let 6 = 71_, 1,6;/v4 and > = Yi w;W;/w+, and define p = 6+ 4,a; = 6; — 4, and 8; =~; — y. 
Then E(Yi;) = +; = wt+a,+ 8; and 4 ViQg = el w;; = 0. To establish uniqueness, suppose 
that pu’,a‘, and By are another set of values satisfying the required conditions. Then 


uta + By =p +a, + 6; forall 7 and j. 


If we multiply both sides by vjw; and sum over i and j, we find that yp = p’. Hence, a; + 8; = aj + 65. 
If we now multiply both sides by v; and sum over 7, we find that 8; = Gj. Similarly, if we multiply both 
sides by w; and sum over j, we find that a; = aj. 


The value of p4,a;, and 8; must be chosen to minimize 
IJ Ki 
Q= dD Mie — w— 4 — 83)”. 
i=1 j=1k=1 


The equation 0Q/Ou = 0 reduces to 


I J 
Yiu4 on >} Kiyop— YK 8 =0, 
i=1 j=l 
where n = K++ is the total number of observations in the two-way layout. Next we shall calculate cae 
for i=1,...,I —1, keeping in mind that )7/_, Kia; = 0. Hence, 0a;/da; = —Kj,/K14. It can be 
found that the equation 0Q/0a; = 0 reduces to the following equation for i = 1,...,/—1: 
Kis 
Yit4+ — Kipu — Kipoy — se Ki Bj = eRe = Kh peo = 2 K778;). 
j v j 
In other words, the following quantity must have the same value for 7 = 1,..., J: 


i 
= (ye — Kizp— Kia — 5 8) 
i+ j 


Similarly, the set of equations 0Q/06; = 0 for 7 = 1,...,J — 1 reduces to the requirement that the 
following quantity have the same value for 7 = 1,..., J: 


1 
ae (Yess — Kyjp-— >) Kyo - 430s) 
i 


It can be verified by direct substitution that the values of 1, @;, and 8; given in the exercise satisfy all 
these requirements and, hence, are the least squares estimators. 


As in the solution of Exercise 18 of Sec. 11.8, let 


m = ea 


T,8,t 
a = S GrstYrst; 


r,s,t 


B; = Sb Ved: 


T,8,t 
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To show that Cov(ji,@;) = 0, we must show that 7,4 Mrst@rst = 0. But mrst = 4 for all r, s, t, and 


1 1 f . 
—— or r=i 
_)Kit n 
arst = 1 
—— for Pos. 
n 


Hence, it is found that ye MrstQrst = 0. Similarly, 


for S=J, 


ax fj. 


for 


and > MpstOrst = 0, SO Cov(ji, B;) = 0. 


We must show that >> aps¢b-s: = 0, where a,.; and b,<¢ are given in the solution of Exercise 21: 


Ki; Kis Ky; 
1 1 1 1 1 1 1 1 i) 1 
b = ———e SS ee Sei = ee 
Date rst pn Con ~) (jz *) Ge -) 22S *) 
1 
+5 Kes. 
ss rAt s#j 


Since nj; = Kj4K4;, it can be verified that this sum is 0. 


Consider the expression for 6;;;, given in this exercise. If we sum both sides of this expression over 7, 
j, and k, then it follows from the constraints on the a’s, 6’s, and 7s, that = 6,44. If we substitute 
this value for 4: and sum both sides of the expression over j and k, we can solve the result for af 


a . 
Similarly, ay can be found by summing over 7 and k, and a’ by summing over 7 and j. After these 


A 


values have been found, we can determine 64? by summing both sides over k, and determine poe and 


ay 
BRE similarly. Finally, 7; is determined by taking its value to be whatever is necessary to satisfy the 


required expression for 6;;,. In this way, we obtain the following values: 


w= Oys4, 

oj! = Gi44 —O444, 

BP = Bie Opal, 

af = 6444-0444, 

Fr = Gy Di igh ones 

BRO = Disk — Oin+ —O44h + 0444, 

BRO = O45n — 8454 —O44e + 444, 

Yige = Gige — Oig — Oise — Ogu t Gira + 0454 t+ Oron — O44 


It can be verified that these quantities satisfy all the specified constraints. They are unique by the 


method of their construction, since they were derived as the only values that could possibly satisfy the 
constraints. 


(a) The plot of Buchanan vote against total county vote is in Fig. $.11.3. Palm Beach county is 
plotted with the symbol P. 
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Buchanan Vote 


T T T T T T T 
i) 100000 200000 300000 400000 500000 600000 


Total County Vote 


Figure $.11.3: Figure for Exercise 24a in Sec. 11.9. 


(b) The summary of the fitted regression is Gy) = 83.69, 8; = 0.00153, Z, = 8.254 x 104, 52 = 
1.035 x10", w=6T, ado’ = 120.1: 


(c) The plot of residuals is in Fig. $.11.4. Notice that the residuals are much more spread out at the 


WE ye, 


Residual 


T T T T T T 
100000 200000 300000 400000 500000 600000 


° 


Total County Vote 


Figure $.11.4: Figure for Exercise 24c in Sec. 11.9. 


right side of the plot than at the left. There also appears to be a bit of a curve to the plot. 


(d) The summary of the fitted regression is By = —2.746, 3, = 0.7263, Zp, = 10.32, s2 = 151.5, n = 67, 
and o’ = 0.4647. 


(ce) The new residual plot is in Fig. $.11.5. The spread is much more uniform from right to left and 


the curve is no longer evident. 


(f) The quantile we need is Tgs'(0.995) = 2.654. The logarithm of total vote for Palm Beach county 
is 12.98. prediction interval for logarithm of Buchanan vote when X = 12.98 is 


1/2 
1 12.98 — 10.32)? 
—2.746 + 12.98 x 0.7263 + 2.654 x 0.4647 ( ieee aE | 


= (5.419, 7.949]. 
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Residual 
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8 9 10 W 12 13 


Log-Total County Vote 


Figure $.11.5: Figure for Exercise 24e in Sec. 11.9. 


Converting to Buchanan vote, we take e to the power of each endpoint and get the interval 
(225.6, 2812]. 


(g) The official Gore total was 2912253, while the official Bush total was 2912790. Suppose that 2812 
people in Palm Beach county had actually voted for Buchanan and the other 3411 — 2812 = 599 
had voted for Gore. Then the Gore total would have been 2912852, enough to make him the 
winner. 
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Simulation 


All exercises that involve simulation will produce different answers when run with repeatedly. Hence, one 
cannot expect numerical results to match perfectly with any answers given here. 


For all exercises that require simulation, students will need access to software that will do some of the 
work for them. At a minimum they will need software to simulate uniform pseudo-random numbers on the 
interval [0,1]. Some of the exercises require software to simulate all of the famous distributions and compute 
the c.d.f.’s and quantile functions of the famous distributions. 


Some of the simulations require a nonnegligible programming effort. In particular, Markov chain Monte 
Carlo (MCMC) requires looping through all of the coordinates inside of the iteration loop. Assessing conver- 
gence and simulation standard error for a MCMC result requires running several chains in parallel. Students 
who do not have a lot of programming experience might need some help with these exercises. 


If one is using the software R, the function runif will return uniform pseudo-random numbers, the 
argument is how many you want. For other named distributions, as mentioned earlier in this manual, one 
can use the functions rbinom, rhyper, rpois, rnbinom, rgeom, rnorm, rlnorm, rgamma, rexp, rbeta, and 
rmultinom. 


Most simulations require calculation of averages and sample variances. The functions mean, median, and 
var compute the average, sample median, and sample variance respectively of their first argument. Each of 
these has an optional argument na.rm, which can be set either to TRUE or to FALSE (the default). If true, 
na.rm causes missing values to be ignored in the calculation. Missing values in simulations should be rare if 
calculations are being done correctly. Other useful functions for simulations are sort and sort.list. They 
both take a vector argument. The first returns its argument sorted algebraically from smallest to largest (or 
largest to smallest with optional argument decreasing=TRUE.) The second returns a list of integers giving the 
locations (in the vector) of the ordered values of its argument. The functions min and max give the smallest 
and largest values of their argument. 


For looping, one can use 
for(i in i:n){ ... } 
to perform all of the functions between { and } once for each value of i from 1 to n. For an indeterminate 
number of iterations, one can use 
while(ezpression){ ... } 
where expression stands for a logical expression that changes value from TRUE to FALSE at some point during 
the iterations. 


A long series of examples appears at the end. 
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12.1 What is Simulation? 


Solutions to Exercises 


1. 


2. 


5. 


Simulate a large number of exponential random variables with paremeter 1, and take their average. 


We would expect that every so often one of the simulated random variables would be much larger than 
the others and the sample average would go up significantly when that random variable got included 
in the average. The more simulations we do, the more such large observations we would expect, and 
the average should keep getting larger. 


. We would expect to get a lot of very large positive observations and a lot of very large negative 


observations. Each time we got one, the average would either jump up (when we get a positive one) 
or jump down (when we get a negative one). As we sampled more and more observations, the average 
should bounce up and down quite a bit and never settle anywhere. 


. We could count how many Bernoulli’s we had to sample to get a success (1) and call that the first 


observation of a geometric random variable. Starting with the next Bernoulli, start counting again 
until the next 1, and call that the second geometric, etc. Average all the observed geometrics to 
approximate the mean. 


(a) Simulate three exponentials at a time. Call the sum of the first two X and call the third one Y. 
For each triple, record whether X < Y or not. The proportion of times that X < Y in a large 
sample of triples approximates Pr(X < Y). 


(b) Let 21, Z2, Z3 be i.i.d. having the exponential distribution with parameter (, and let W,, W2, W3 
be i.i.d. having the exponential distribution with parameter 1. Then Z, + Z < Zs if and only if 
BZ, + BZ2 < BZ3. But (821, 8Z2,6Z3) has precisely the same joint distribution as (W1, W2, W3). 
So, the probability that 7, + Z2 < Z3 is the same as the probability that Wy, + W2 < Ws, and 
it doesn’t matter which parameter we use for the exponential distribution. All simulations will 
approximate the same quantity as we would approximate using parameter 1. 


(c) We know that X and Y are independent and that X has the gamma distribution with parameters 
2 and 0.4. The joint p.d-f. is 


f(a, y) = 0.422 exp(—0.4x)0.4 exp(—0.4y), for x,y > 0. 
The integral to compute the probability is 
PrA <Y¥)= I i 0.432 exp(—0.4[ax + y])dydz. 
x 
There is also a version with the x integral on the inside. 


oo ry 
Pre <Y¥)= i. | 0.4.x exp(—0.4[a + y])daxdy. 
0 0 


12.2 Why Is Simulation Useful? 


Commentary 


This section introduces the fundamental concepts of simulation and illustrates the basic calculations that 
underlie almost all simulations. Instructors should stress the need for assessing the variability in a simulation 
result. For complicated simulations, it can be difficult to assess variability, but students need to be aware 
that a highly variable simulation may be no better than an educated guess. 
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The lengthy examples (12.2.13 and 12.2.14) at the end of the section and the exercises (15 and 16) that 
go with them are mainly illustrative of the power of simulation. These would only be covered in course that 
devoted a lot of time to simulation. 


Solutions to Exercises 


1. Since E(Z) = p, the Cheybshev inequality says that Pr(|Z — p| < ©) > e?/Var(Z). Since Z is the 
average of v independent random variables with variance 0”, Var(Z) = 0?/v. It follows that 
Pr(|Z — pl <6) > a 
r(|Z — e)>—. 
WS) 25 


Now, suppose that v > o?/[e?(1 — y), then 


2. In Example 12.2.11, we are approximating o by 0.3892. According to Eq. (12.2.6), we need 


0.3892 


SOE" =: 15147664 
"= 0.012 x 01 


So, v must be at least 151477. 


3. We could simulate a lot (say vo) standard normal random variables Wj,...,W,, and let X; = 7W; +2. 
Then each X; has the distribution of X. Let W; = log(|X;| +1). We could then compute Z equal to 
the average of the W;’s as an estimate of E(log(|X| + 1)). If we needed our estimate to be close to 
E(log(|X|+1)) with high probability, we could estimate the variance of W; by the sample variance and 
then use (12.2.5) to choose a possibly larger simulation size. 


4. Simulate 15 random variables U;,...,Ui5 with uniform distribution on the interval [0,1]. For 7 = 
1,...,13, let X; = 2(U; — 0.5) and for i = 14,15, let X; = 20(U; — 0.5). Then Xj,...,X15 have the 
desired distribution. In most of my simulations, the median or the sample average was the closest to 
0. The first simulation led to the following six values: 


Trimmed mean 
Estimator | Average kK=1 k=2 k=3 k=4 Median 


Estimate 0.5634 0.3641 0.2205 0.2235 0.2359 0.1836 


5. (a) In my ten samples, the sample median was closest to 0 nine times, and the k = 3 trimmed mean 
was closet to 0 one time. 


(b) Although the k = 2 trimmed mean was never closest to 0, it was also never very far from 0, and it 
had the smallest average squared distance from 0. The & = 3 trimmed mean was a close second. 
Here are the six values for my first 10 simulations: 


Trimmed mean 
Estimator | Average kK=1 k=2 k=3 k&k=4 Median 


M.S.E. 0.4425 0.1354 0.0425 0.0450 0.0509 0.0508 


These rankings were also reflected in a much larger simulation. 
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6. (a) Simulate lots (say v9) of random variables X1,...,X y,. and Yj,...,Yv) with X; have the beta 
distribution with parameters 3.5 and 2.7 while Y; have the beta distribution with parameters 1.8 
and 4.2. Let Z; = X;/(X; + Y;). The sample average of Z),...,2Z,, should be close to the mean 
of X/(X + Y) if vo is large enough. 


(b) We could calculate the sample variance of Z1,...,Z,), and use this as an estimate of o? in 
Eq. (12.2.5) with y = 0.98 and e = 0.01 to obtain a new simulation size. 
7. (a) The distribution of X is the contaminated normal distribution with p.d.f. given in Eq. (10.7.2) 
witho =1, w=0. 


(b) To calculate a number in Table 10.40, we should simulate lots of samples of size 20 from the 
distribution in part (a) with the desired ¢€ (0.05 in this case). For each sample, compute the 
desired estimator (the sample median in this case). Then compute the average of the squares 
of the estimators (since 4 = 0 in our samples) and multiply by 20. As an example, we did two 
simulations of size 10000 each and got 1.617 and 1.621. 


8. (a) The description is the same as in Exercise 7(b) with “sample median” replaced by “trimmed mean 
for k = 2” and 0.05 replaced by 0.1. 


(b) We did two simulations of size 10000 each and got 2.041 and 2.088. It would appear that this 
simulation is slightly more variable than the one in Exercise 7. 


9. The marginal p.d.f. of X is 


oo we 7 3 
/ cy exp(—p(x + 1))du = (+D” 


for x > 0. The c.d.f. of X is then 


for x > 0, and F(a) = 0 for x < 0. The median is that x such that F(x) = 1/2, which is easily seen to 
be 21/3 — 1 = 0.2599. 
10. (a) The c.d.f. of each X; is F(x) = 1-—exp(—Az), for x > 0. The median is log(2)/X. 


(b) Let Y; = X;A, and let M’ be the sample median of Yj,..., Y2;. Then the Yj’s have the exponential 
distribution with parameter 1, the median of Y; is log(2), and M’ = MX. The M.S.E. of MW is then 


d2 


= 5gE[(M" — 10g(2))" 


0 

2° 

(c) Simulate a lot (say 21vg) of random variables X,...,X2iy) having the exponential distribution 
with parameter 1. For 7 = 1,...,vo, let M; be the sample median of X91(;~1)41,---, X21. Let 


2 
E (a ee | = + FM) —108(2))? 


i 

Y; = (M; — log(2))?, and compute the sample average Z = — Le Y; as an estimate of 6. If you 
es 

want to see how good an estimate it is, compute the simulation standard error. 


ll 


12. 


13. 


14. 


15. 
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In Example 12.2.4, ju, and py are independent with (Wz—f21)/(B21/[ae1 deal)? having the ¢ distribution 
with 2,1 degrees of freedom and (ry — pfyi)/(By1/[ayiAyi])/? having the t distribution with 2ay1 


degrees of freedom. We should simulate lots (say v) of t random variables TO) pees ie) with 2a,1 
degrees of freedom and just as many ¢ random variables 7; (1) ines se with 2a,; degrees of freedom. 
Then let 
1/2 
@) _ pw (—_) 
1/2 
@) _ pe (Bu 
By My! y () , 
for 7 =1,...,v. Then the values ue — uO form a sample from the posterior distribution of fz — [y. 


To the level of approximation in Eq. (12.2.7), we have 
Z=QE(Y),EW)) + (EY), EW) — EY)| + (EY), E(W))[W — EW). 
The variance of Z would then be 


m(E(Y), E(W))? Var(Y) + g2(E(Y), E(W))? Var(W) (6.12.1) 
+29 (L(Y), E(W))g2(E(Y), E(W)) Cov(Y, W). 


Now substitute the entries of %: for the variances and covariance. 


The function g in this exercise is g(y,w) = w — y? with partial derivatives 


nty,w) = 2y, 
galy,w) = 1. 


In the formula for Var(Z) given in Exercise 12, make the following substitutions: 


Exercise 12 | This exercise 


where Z, V, and C are defined in Example 12.2.10. The result is [(2Y)?Z+V +4Y C]/v, which simplifies 
to (12.2.3). 


Let Yj,...,Y, be a large sample from the distribution of Y. Let Y be the sample average, and let V 
be the sample variance. For each i, define W; = (Y; — Y)?/V. Estimate the skewness by the sample 
average of the W;’s. Use the sample variance to compute a simulation standard error to see if the 
simulation size is large enough. 


(a) Since S, = So exp(au + W,,), we have that 
E(Su) = So exp(au)E (exp(Wx)) = So exp(au)p(1). 


In order for this mean to be Sp exp(ru), it is necessary and sufficient that (1) = exp(u[r — a]), 
or equivalently, a = r — log(w(1))/u. 
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(b) First, simulate lots (say v) of random variables W“,...,W) with the distribution of W,,. Define 
the function h(s) as in Example 12.2.13. Define Y = exp(—ru)h(Sp exp[au + W]), where r is 
the risk free interest rate and a is the number found in part (a). The sample average of the y's 
would estimate the appropriate price for the option. One should compute a simulation standard 
error to see if the simulation size is large enough. 


16. We can model our solution on Example 12.2.14. We should simulate a large number of operations of 


the queue up to time ¢. For each simulated operation of the queue, count how many customers are 
in the queue (including any being served). In order to simulation one instance of the queue operation 
up to time t, we can proceed as follows. Simulate interarrival times X,, Xo,... as exponential random 


j 
variables with parameter \. Define T; = by T; for 7 = 1,2,.... Stop simulating at the first & such that 


i=1 
Ty >t. Start the queue with Wo = 0, where W; stands for the time that customer 7 leaves the queue. 
In what follows, S; € {1,2} will stand for which server serves customer j, and Z; will stand for the 
time at which customer j begins being served. 


For 7 =1,...,4—1 and each i < j, define 


io LW; 7; 
‘J ~~) 0 otherwise. 
j-l 
The number of customers in the queue when customer j arrives is r = yy, Ui,j- 
i=0 
e Ifr =0, simulate U with a uniform distribution on the interval [0,1]. Set S; = 1 if U < 1/2 and 
Sj =72 ifU > 1/2. Set Z; = Tj. 
e If r = 1, find the value 7 such that W; > T; and set S; = 2 — S; so that customer j goes to the 
other server. Set 2; = Tj. 


e If r > 2, simulate U with a uniform distribution on the interval [0,1], and let customer j leave if 
U <p,. If customer j leaves, set W; = T;. If customer j does not leave, find the second highest 
value Wj out of W1,...,W j-1 and set S; = Sy and Z; = Wy. 


For each customer that does not leave, simulate a service time Y; having an exponential distribution 
with parameter yg,, and set W; = Z; + Y;. The number of customers in the queue at time t is the 
number of j € {1,...,4 — 1} such that W; > t. 


12.3 Simulating Specific Distributions 


Commentary 


This section is primarily of mathematical interest. Most distributions with which students are familiar can be 
simulated directly with existing statistical software. Instructors who wish to steer away from the theoretical 
side of simulation should look over the examples before skipping this section in case they contain some points 
that they would like to make. For example, a method is given for computing simulation standard error when 
the simulation result is an entire sample c.d.f. (see page 811). This relies on results from Sec. 10.6. 


Solutions to Exercises 


1. 


(a) Here we are being asked to perform the simulation outlined in the solution to Exercise 10 in 
Sec. 12.2 with vo = 2000 simulations. Each Y; (in the notation of that solution) can be simulated 


2. 


6. 
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by taking a random variable U; having uniform distribution on the interval [0,1] and setting 
Y; = —log(1 — U;). In addition to the run whose answers are in the back of the text, here are 
the results of two additional simulations: Approximation = 0.0536, sim. std. err. = 0.0023 and 
Approximation = 0.0492, sim. std. err. = 0.0019. 


(b) For the two additional simulations in part (a), the value of v to achieve the desired goal are 706 
and 459. 


Let Vj =a+(b—a)U;. Then the p.d-f. of V; is easily seen to be 1/(b — a) for a< v < b, so V; has the 
desired uniform distribution. 


. The c.d.f. corresponding to gy is 


ex 
Gi(z) = f saptao”, for 0< @< 1. 


The quantile function is then Gr (p) = p’ for 0 < p < 1. To simulate a random variable with the 
p.d.f. gi, simulate U with a uniform distribution on the interval [0,1] and let X = U?. The c.d-f. 
corresponding to go is 


- 1 
Go(a) = [ ape te) for0O<a <1. 


The quantile function is then Gy'(p) = 1— (1 —p)? for 0 < p <1. To simulate a random variable with 
the p.d.f. go, simulate U with a uniform distribution on the interval [0,1] and let X = 1 —(1-—U)?. 


. The c.d.f. of a Cauchy random variable is 


F(z) = i aH = = arctan(a) + | 


The quantile function is F~'(p) = tan([p — 1/2]). So, if U has a uniform distribution on the interval 
(0, 1], then tan(a[U — 1/2]) has a Cauchy distribution. 


. The probability of acceptance on each attempt is 1/k. Since the attempts (trials) are independent, the 


number of failures X until the first acceptance is a geometric random variable with parameter 1/k. The 
number of iterations until the first acceptance is X +1. The mean of X is (1 — 1/k)/(1/k) =k —1, so 
the mean of X + 1 is k. 


(a) The c.d.f. of the Laplace distribution is 
1 
— exp(z) ie =i, 


an | 
P(a) =f) Sexr(-ltat=4 > | 
aa 1- 5 &xp(—2) it a> 0. 


The quantile function is then 


1, _ f log(2p) if0<p<1/2, 
Fol\(p) = ioe(att —pl) if aes <i. 


Simulate a uniform random variable U on the interval [0,1] and let X = F~'(U). 
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(b) Define 


fle) = Gq ew(-#"/2), 


1 
(2) = 5 exp(—le/). 
We need to find a constant k such that kg(x) > f(x) for all x. Equivalently, we need a constant c 
such that 
exp(—a?/2) 
exp(—|z|) ' 


for all x. Then we can set k = c(2/7)!/?. The smallest ¢ that satisfies (S.12.2) is the supremum 
of exp(|x| — x?/2). This function is symmetric around 0, so we can look for sup exp(x — 27/2). To 
«r>0 


(S.12.2) 


maximize this, we can maximize x — «7/2 instead. The maximum of x — x?/ 2 occurs at 2 = 1, so 
c = exp(1/2). Now, use acceptance/rejection with k = exp(1/2)(2/m)!/? = 1.315. 


4 
Simulate a random sample X1,...,X 1, from the standard normal distribution. Then a xe has the 
i=1 
1 ; 
x? distribution with 4 degrees of freedom and is independent of x X?, which has the y? distribution 
i=5 


with 7 degrees of freedom. It follows that 
4 
ty 
i=1 
11 
4° X? 
i=5 


the F distribution with 4 and 7 degrees of freedom. 


(a) I did five simulations of the type requested and got the estimates 1.325, 1.385, 1.369, 1.306, and 
1.329. There seems to be quite a bit of variability if we want three significant digits. 
(b) The five variance estimates were 1.333, 1.260, 1.217, 1.366, and 1.200. 


(c) The required sample sizes varied from 81000 to 91000, suggesting that we do not yet have a very 
precise estimate. 


. The simplest acceptance/rejection algorithm would use a uniform distribution on the interval [0,2]. 


That is, let g(x) = 0.5 for 0 < « < 2. Then (4/3)g9(x) > f(x) for all x, ie. k = 4/3. We could simulate 
U and V both having a uniform distribution on the interval [0,1]. Then let X = 2V if 2f(2V) > (4/3)U 
and reject otherwise. 


Using the prior distribution stated in the exercise, the posterior distributions for the probabilities of no 
relapse in the four treatment groups are 


Beta with parameters 


Group a B 
Imipramine 23 19 
Lithium 26 15 
Combination | 17 22 


Placebo 11 25 
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We then simulate 5000 vectors of four beta random variables with the above parameters. Then we 
see what proportion of those 5000 vectors have the imipramine parameter the largest. We did five 
such simulations and got the proportions 0.1598, 0.1626, 0.1668, 0.1650, and 0.1650. The sample sizes 
required to achieve the desired accuracy are all around 5300. 


11. The x? distribution with m degrees of freedom is the same as the gamma distribution with parameters 
m/2 and 1/2. So, we should simulate Y® having the x? distribution with n — p degrees of freedom 
and set 7) = YO /$2 4. 


12. We did a simulation of size v = 2000. 


(a) The plot of the sample c.d.f. of the |? — | values is in Fig. $.12.1. 


eH ee Fe 


Sample d.f. 


Absolute 
difference between group means 


Figure $.12.1: Sample c.d.f. of |? — | values for Exercise 12a in Sec. 12.3. 


(b) The histogram of the ratios of calcium supplement precision to placebo precision is given in 
Fig. $.12.2. Only 12% of the simulated log (7? /7h) were positive and 37% were less than —1. 
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Figure $.12.2: Histogram of log (rf / 7h) values for Exercise 12b in Sec. 12.3. 


There seems to be a sizeable probability that the two precisions (hence the variances) are unequal. 
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Let X = F~!(U), where F~! is defined in Eq. (12.3.7) and U has a uniform distribution on the interval 
[0,1]. Let G be the c.d.f. of X. We need to show that G = F, where F is defined in Eq. (12.3.6). Since 
F-! only takes the values t1,...,tn, it follows that G has jumps at those values and if flat everywhere 
else. Since F’ also has jumps at tj,...,t, and is flat everywhere else, we only need to show that 
F(x) = G(a) for x € {t,,...,tn}. Let ¢q, =1. Then X < ¢; if and only if U < q; fori =1,...,n. Since 
Pr < gq) =q, i follows that GUt;) =u; for t= 1,..,n; That is, G(¢)=F (a) for #6 {hij es05 ty}. 


First, establish the Bonferroni inequality. Let A1,...,A, be events. Then 
k k k k 
Pr (n A.) =1-—Pr (U as) >1- S~ Pr(Af) =1- Soi — Pr(A;)]. 
i=l i=1 i=1 i=1 
Now, let & = 3 and 
Ay = {|Goa(z) — Gi(x)| < 0.0082, forall z}, 
for i = 1,2,3. The event stated in the exercise is N?_,A;. According to the arguments in Sec. 10.6, 
Pr (60000"/?|G,,¢(a) — G(a)| < 2, for all x) ~ 0.9993. 
Since 2/60000!/2 = 0.0082, we have Pr(A;) © 0.9993 for i = 1,2,3. The Bonferroni inequality then 


says that Pr(M_,A;) © 0.9979 or more. 


The proof is exactly what the hint says. All joint p.d.f.’s should be considered joint p.f./p.d.f.’s and 
the p.d.f.’s of X and Y should be considered p.f.’s instead. The only integral over x in the proof is in 
the second displayed equation in the proof. The outer integral in that equation should be replaced by 
a sum over all possible x values. The rest of the proof is identical to the proof of Theorem 12.3.1. 


k 
Let p; = exp(—0)0*/(i!) for i = 0,1,... and let q, = a Let U have a uniform distribution on the 


i=l 
interval [0,1]. Let Y be the smallest k such that U < q,. Then Y has a Poisson distribution with mean 
6. 


Let {x1,...,2%m} be the set of values that have positive probability under at least one of gi,...,9n- 
That is, for each 7 = 1,...,m there is at least one 7 such that g;(a;) > 0 and for each i = 1,...,n, 


oe gi(xj) = 1. Then, the law of total probability says that 
j=l 


Since Pr(J = 7) = 1/n for i=1,...,n and Pr(X = 2;|IJ = 7) = g(x;), it follows that 


=_,)= 259 —g;(x;). (S.12.3) 


Since 71,...,2m are the only values that X can take, Eq. (S.12.3) specifies the entire p.f. of X and we 
see that Eq. (S.12.3) is the same as Eq. (12.3.8). 


18. 


19. 
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The Poisson probabilities with mean 5 from the table in the text are 


0 1 2 3 4 5 6 7 8 
0067 =.0337 =.0842) 1404 1755) 1755) £1462) 10440653 
9 10 11 12 13 14 15 16 
0363 .0181 .0082 .0034 = .0013) .0005 =.0002—-.0001 


where we have put the remainder probability under « = 16. In this case we have n = 17 different possible 
values. Since 1/17 = .0588, we can use x; = 0 and y; = 2. Then g;(0) = .1139 and g;(2) = .8861. 
Then f[(2) = .8042 — (1 — .1139)/17 = .0321. Next, take x2 = 1 and yo = 3. Then go(1) = .5729 
and go(3) = .4271. This makes f3(3) = .1153. Next, take 73 = 2 and y3 = 3 so that g3(2) = .5453, 
g3(3) = .4547, and f3(3) = .0885. Next, take x4 = 9 and ys = 3, etc. The result of 16 such iterations 
is summarized in Table $.12.1. 


Table $.12.1: Result of alias method in Exercise 18 of Sec. 12.3 


t| x g9i(ti) | Ye gilys) |] 2] te gilts) | vigils) 
1 QO 1139} 2 .8861 |} 10] 13 .0221 5 .9779 
2 1.5729} 3 .4271 |} 11] 14 .0085 | 5 .9915 
3 2 .5453 | 3 .4547 || 12 5 6246] 6 .3754 
4 9 6171] 3 .3829 |} 13} 15 .0034] 6 .9966 
5] 10 .38077 | 3 .6923 || 14] 16 .0O17] 6 .9983 
6 3. .4298 | 4. .5702 || 15 6 1151 |] 7  .8849 
7)]11 .1394 } 4 .8606 |} 16 7 .8899 | 8 .1101 
8] 12 0578 | 4 .9422 |) 17 8 1 

9 4 6105] 5 .3895 


The alias method is not unique. For example, we could have started with 7; = 1 and y, = 3 or many 
other possible combinations. Each choice would lead to a different version of Table $.12.1. 


For k = 1,...,n, 1 =k if and only ifk <nY +1<k+1. Hence 
k-1 =) =< 


eve 
n n 


Pr(J =k) = Pr( = 


The conditional c.d.f. of U given I = k is 


Pr(U <t\f=k) = Pr(nY +1—-I <#|f=hk) 
Pr(nY +1—-—k<t,I=k) 


PHT =f) 
pr(y <2) cy <5) 
- n n n 
1/n 
— t+k-1 
= nPr( leye = ) 
n 


for 0 << t <1. So, the conditional distribution of U given J = k is uniform on the interval [0, 1] for all 
k;. Since the conditional distribution is the same for all k, U and J are independent and the marginal 
distribution of U is uniform on the interval [0, 1]. 
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12.4 Importance Sampling 


Commentary 


Users of importance sampling might forget to check whether the importance function leads to a finite variance 
estimator. If the ratio of the function being integrated to the importance function is not bounded, one might 
have an infinite variance estimator. This doesn’t happen in the examples in the text, but students should 
be made aware of the possibility. This section ends with an introduction to stratified importance sampling. 
This is an advanced topic that is quite useful, but might be skipped in a first pass. The last five exercises in 
this section introduce two additional variance reduction techniques, control variates and antithetic variates. 
These can be useful in many types of simulation problems, but those problems can be difficult to identify. 


Solutions to Exercises 


1. 


3. 


b 
We want to approximate the integral / g(x)dz. Suppose that we use importance sampling with f 


being the p.d.f. of the uniform distribution on the interval [a,b]. Then g(x)/f(x) = (b—a)g(x) for 
a<«<b. Now, (12.4.1) is the same as (12.4.2). 


. First, we shall describe the second method in the exercise. We wish to approximate the integral 


/ g(x) f(x)dx using importance sampling with importance function f. We should then simulate values 

X with p.d.f. f and compute 

g( XM) F(X) 
f(X@) 


The importance sampling estimate is the average of the Y values. Notice that this is precisely the 
same as the first method in the exercise. 


your = g(X), 


(a) This is a distribution for which the quantile function is easy to compute. The c.d-f. is F(x) = 
1 —(c/x)"/? for « > c, so the quantile function is F~!(p) = c/(1 — p)?/". So, simulate U having a 
uniform distribution on the interval [0,1] and let X = ¢/(1 — U)?/". Then X has the p.d.f. f. 

(b) Let 


Then the p.d.f. of Y is g(x) = ax’"/2)-!/(ma + n)'"+™/?, for « > 0. Hence, 


oo g(m/2)—1 
Pry > c) = re “(ma + nyorrny72 x 


We could approximate this by sampling lots of values X with the p.d.f. f from part (a) and 
then averaging the values g(X)/f(X). 
(c) The ratio g(x)/f(zx) is, for x >, 


g(x) agimtn)/2 a 


F(a) &°?2(n/2)(ma + nyorrn?2 ~ crl2(n]2)(m + n/xyorrny/2” 
This function is fairly flat for large x. Since we are only interested in x > c in this exercise, 
importance sampling will have us averaging random variables g(X)/f(X) that are nearly 
constant, hence the average should have small variance. 


4, 


5. 


6. 
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(a) If our 10000 exponentials are X ()., | (10000) then our approximation is the average of the values 
log(1 +X). In two example simulations, I got averages of 0.5960 and 0.5952 with simulation 
standard errors of 0.0042 both times. 


(b 


eH 


Using importance sampling with the importance function being the gamma p.d.f. with parameters 
1.5 and 1, I got estimates of 0.5965 and 0.5971 with simulation standard errors of 0.0012 both 
times. 


(c) The reason that the simulations in part (b) have smaller simulation standard error is that gamma 
importance function is a constant times z!/? exp(—x). The ratio of the integrand to the importance 
function is a constant times log(1 + 2)a~!/?, which is nearly constant itself. 


Let U have a uniform distribution on the interval [0,1], and let W be defined by Eq. (12.4.6). The 
inverse transformation is 


o (“ = He) 
oN We 
02 


The derivative of the inverse transformation is 


— 


1 1 


(2n)'/2o,0 (2—H2) — (-s3 7 H2)*) (8.12.4) 


Since the p.d.f. of U is constant, the p.d.f. of W is (S.12.4), which is the same as (12.4.5). 


(a) We can simulate truncated normals as follows. If U has a uniform distribution on the interval 
[0,1], then X = ®—!(6(1) + U[1 — ®(1)]) has the truncated normal distribution in the exercise. If 
X,..., X(009) are our simulated values, then the estimate is the average of the (1 — 6(1)) x? 
values. Three simulations of size 1000 each produced the estimates 0.4095, 0.3878, and 0.4060. 


(b) If Y has an exponential distribution with parameter 0.5, and X = (1+Y)!/?, then we can find 
the p.d.f. of X. The inverse transformation is y = 2? — 1 with derivative 2x. The p.d.f. of X is 
then 270.5 exp(—0.5a? + 0.5). If X,..., X09) are our simulated values, then the estimate is 
the average of the X exp(—0.5)/(27)!/? values. Three simulations of size 1000 each produced 
the estimates 0.3967, 0.3980, and 0.4016. 


(c) The simulation standard errors of the simulations in part (a) were close to 0.008, while those from 
part (b) were about 0.004, half as larger. The reason is that the random variables averaged in 


part (b) are closer to constant than those in part (a) since x is closer to constant than x?. 


(a) We can simulate bivariate normals by simulating one of the marginals first and then simulating the 


second coordinate conditional on the first one. For example, if we simulate X. @) U as independent 


normal random variables with mean 0 and variance 1, we can simulate x = 0.5.x. Ue Taye, 
Three simulations of size 10000 each produced estimates of 0.8285, 0.8308, and 0.8316 with simu- 
lation standard errors of 0.0037 each time. 


(b) Using the method of Example 12.4.3, we did three simulations of size 10000 each and got estimates 
of 0.8386, 0.8387, and 0.8386 with simulation standard errors of about 3.4 x 107°, about 0.01 as 
large as those from part (a). Also, notice how much closer the three simulations are in part (b) 
compared to the three in part (a). 
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. The random variables that are averaged to compute the importance sampling estimator are Y = 


g(X)/f(X where the X’s have the p.d.f. f. Since g/f is bounded, Y“ has finite variance. 


. The inverse transformation is v = F(x) with derivative f(x). So, the p.d.f. of X is f(ax)/(b — a) for 


those x that can arise as values of F~!(V), namely F~!(a) < x < F~1(b). 


For part (a), the stratified importance samples can be found by replacing U in the formula used in 
Exercise 6(a) by a + U(b — a) where (a,b) is one of the pairs (0, .2), (.2,.4), (.4,.6), (.6,.8), or (.8, 1). 
For part (b), replace Y by —log(1 — [a +U(b—a)]) in the formula X = (1+ Y)!/? using the same five 
(a,b) pairs. Three simulations using five intervals with 200 samples each produced estimates of 0.4018, 
0.4029, and 0.2963 in part (a) and 0.4022, 0.4016, and 0.4012 in part (b). The simulation standard 
errors were about 0.0016 in part (a) and 0.0006 in part (b). Both parts have simulation standard errors 
about 1/5 or 1/6 the size of those in Exercise 6. 


Since the conditional p.d.f. of X* given J = 7 is f;, the marginal p.d.f. of X* is 


Since f;(~) = kf(x) for qj-1 < « < qj, for each x there is one and only one f;(x) > 0. Hence, 


F(t) = f(a) forall a. 


(a) The m.g.f. of a Laplace distribution with parameters 0 and a is 
eS 1 
W(t) = f expltz) == exp(—|e|/o)de. 
ae o 
The integral from —oco to 0 is finite if and only if t > —1/o. The integral from 0 to oo is finite 
if and only if t < 1/o. So the integral is finite if and only if —1/0 < t < 1/o. The value of the 
integral is 
1 1 i 7... 4 
Qo lt+1/o -t+1/o] 1-+#?0? 
Plugging o? = u/100 into this gives the expression in the exercise. 

(b) With u = 1, ¥(1) = 1/0.99. With r = 0.06, we get a = 0.06 + log(0.99) = 0.04995. We ran 
three simulations of size 100000 each using the method described in the solution to Exercise 15 in 
Sec. 12.2. The estimated prices were So times 0.0844, 0.0838, and 0.0843. The simulation standard 
errors were all about 3.659 x 1074. 

(c) S, > So if and only if W,, > —au, in this case au = 0.04995. The conditional c.d.f. of W,, given 
that W, > —0.04995 is 


0.5[exp(10w) — 0.6068] if —0.04995 < w <0, 


F(w) = 1.1356 1 — 0.5[exp(—10w) + 0.6068] if w > 0. 


The quantile function is then 


Fig) = 0.1 log(1.3931p + 0.6068) ) if 0 <p < 0.2822, 
~ | —0.1log(2[1 — 0.6966p] — 0.6068) if 0.2822 <p <1. 


When we use samples from this conditional distribution, we need to divide the average by 1.4356, 
which is the ratio of the conditional p.d.f. to the unconditional p.d.f. We ran three more simulations 
of size 100000 each and got estimates of So times 0.0845, 0.0846, and 0.0840 with simulation 
standard errors of about 2.66.59 x 10-4. The simulation standard error is only a little smaller than 
it was in part (b). 


14. 


16. 


17. 
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(a) E(Z) = E(Y) + ke = E(W) — kE(V) + ke. By the usual importance sampling argument, 
E(w) = J o(o)ae and E(V) =c, so E(Z) = f g(x)dz. 

(b) Var(Z) = [of +k?o?, —2kpowoy]. This is a quadratic in k that is minimized when k = pow/ov. 


1 
(a) We know that | (1+a7)"'dx = 2/4. We shall use f(x) = exp(—x)/(1 — exp(—1)) for0 <2 <1. 


0 
We shall simulate X@),..., X09) with this p.d.f. and compute 


1 — exp(—1) 
(i) pois AS | 
we 7s Oo 
y@ = exp[X ](1 — exp(—1)) 
7 14+ x2 


We ran three simulations of 10000 each and got estimates of the integral equal to 0.5248, 0.5262, 
and 0.5244 with simulation standard errors around 0.00135. This compares to 0.00097 in Exam- 
ple 12.4.1. We shall see what went wrong in part (b). 


(b) We use the samples in our simulation to estimate ow at 0.0964, oz at 0.0710, and p at —0.8683. 
Since the correlation appears to be negative, we should have used a negative value of k to multiply 
our control variate. Based on our estimates, we might use k = —1.1789. Additional simulations 
using this value of k produce simulation standard errors around 4.8 x 107‘. 


(a) Since U and 1—U™ both have uniform distributions on the interval [0,1], X® = F-1(U™) 
and T = F-1(1—U) have the same distribution. 


(b) Since X and T® have the same distribution, so do W and V™, so the means of W and V 
are both the same and they are both / g(x)dx, according to the importance sampling argument. 


(c) Since F~! is a monotone increasing function, we know that X ( and T® are decreasing functions 
of each other. If g(x)/f(a) is monotone, then W and V™ will also be decreasing functions of 
each other. As such they ought to be negatively correlated since one is small when the other is 
large. 


(d) Var(Z) = Var(Y)/v, and 
Var(Y) = 0.25[Var(W™) + Var(V) + 2Cov(W, V)] = 0.5(1 + p) Var(W™). 


Without antithetic variates, we get a variance of Var(W) /[2u]. If p < 0, then 0.5(1 + p) < 0.5 
and Var(Z) is smaller than we get without antithetic variates. 


Using the method outlined in Exercise 15, we did three simulations of size 5000 each and got estimates 
of 0.5250, 0.5247, and 0.5251 with estimates of Var(Y)!/? of about 0.0238, approximately 1/4 of 63 
from Example 12.4.1. 


In Exercise 3(c), g(x)/f (a) is a monotone function of x, so antithetic variates should help. In Exercise4(b), 
we could use control variates with h(a) = exp(—x). In Exercises 6(a) and 6(b) the ratios g(x)/f(x) 
are monotone, so antithetic variates should help. Control variates with h(a) = x exp(—a?/2) could also 
help in Exercise 6(a). Exercise 10 involves the same function, so the same methods could also be used 
in the stratified versions. 
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12.5 Markov Chain Monte Carlo 


Commentary 


Markov chain Monte Carlo (MCMC) is primarily used to simulate parameters in a Bayesian analysis. Im- 
plementing Gibbs sampling in all but the simplest problems is generally a nontrivial programming task. 
Instructors should keep this in mind when assigning exercises. The less experience students have had with 
programming, the more help they will need in implementing Gibbs sampling. The theoretical justification 
given in the text relies on the material on Markov chains from Sec. 3.10, which might have been skipped 
earlier in the course. This material is not necessary for actually performing MCMC. 

If one is using the software R, there is no substitute for old-fashioned programming. (There is a package 
called BUGS: 
http://www.mrc-bsu.cam.ac.uk/bugs/ but I will not describe it here.) After the solutions, there is R code 
to do the calculations in Examples 12.5.6 and 12.5.7 in the text. 


Solutions to Exercises 


1. The conditional p.d.f. of Xq given X2g = x2 is 


91 (21|z2) = Ea) = Se) = ho(x1) 


fo(x2) fo(x2) fo(x2) 


Let cg = c/ fo(x2), which does not depend on 21. 


2. Let fo(x2) = | fenae)der stand for the marginal p.d.f. of X2, and let gi(#1|%2) = f (x1, %2)/fo(x2) 


stand for the conditional p.d.f. of X ©) given Be = x2. We are supposing that x has the marginal 


distribution with p.d.f. fo. In step 2 of the Gibbs sampling algorithm, after xs? = £2 is observed, 


x is sampled from the distribution with p.d.f. ga(x1|x2). Hence, the joint p.d.f. of er xe) 


is fo(@2)91(@1, %2) = f (#1, £2). In particular xin) has the same marginal distribution as X,, and the 
same argument we just gave (with subscripts 1 and 2 switched and applying step 3 instead of 2 in the 


Gibbs sampling algorithm) shows that Cor aa) has the same joint distribution as (x0 ; xh). 


3. Let h(z) stand for the p.f. or p.d-f. of the stationary distribution and let g(z|z’) stand for the conditional 
p.d.f. or p.f. of Zj41 given Z; = 2’, which is assumed to be the same for all 7. Suppose that Z; has 
the stationary distribution for some 7, then (Z;, Z;41) has the joint p.f. or p.d.f. h(z;)g(zi41|2z;). Since 
Z, does have the stationary distribution, (21, Z2) has the joint p.f. or p.d.f. h(z1)g(z2|z1). Hence, 
(Z,, Z2) has the same distribution as (Z;, Z;41) whenever Z; has the stationary distribution. The proof 
is complete if we can show that Z; has the stationary distribution for every 7. We shall show this by 
induction. We know that it is true for i = 1 (that is, Z; has the stationary distribution). Assume that 
each of Z1,...,Z, has the stationary distribution, and prove that Z,4, has the stationary distribution. 
Since h is the p.d-f. or p.f. of the stationary distribution, it follows that the marginal p.d.f. or p.f. of 


Zr+1 is [ hlen)g(ensalen)den or ae : h(zr)g(2e41|2k), either of which is h(z,41) by the definition of 
k 
stationary distribution. Hence 2,1 also has the stationary distribution, and the induction proof is 


complete. 


4. Var(X) = 07/n and 


_ eo 1 
Var(Y) = ne: ¥, S Cov(¥,, ¥;). 
iA; 
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Since Cov(Yi, Y;) > 0, Var(Y) > o?/n = Var(X). 


. The sample average of all 30 observations is 1.442, and the value of s? is 2.671. The posterior hyper- 
parameters are then 


a, = 15.5, Ay = 31, wy = 1.4277, and 6; = 1.930. 


The method described in Example 12.5.1 says to simulate values of 4 having the normal distribution 
with mean 1.4277 and variance (317)~! and to simulate values of 7 having the gamma distribution with 
parameters 16 and 1.930+0.5(— 1.4277)”. In my particular simulation, I used five Markov chains with 
the following starting values for py: 0.4, 1.0, 1.4, 1.8, and 2.2. The convergence criterion was met very 
quickly, but we did 100 burn-in anyway. The estimated mean of (,/7)~! was 0.2542 with simulation 
standard error 4.71 x 1074. 


. The data summaries that we need to follow the pattern of Example 12.5.4 are the following: 


@, = 12.5 %2 = 47.89 y = 2341.4 
sit = 59020 812 = 16737 $22 = 61990.47 
Siy = 927865 so, = 3132934 sy, = 169378608, 
and n = 26. 


(a) The histogram of 0? | values is in Fig. $.12.3. 


Sample d.f. 


20 40 60 80 100 120 


Figure $.12.3: Sample c.d.f. of (a0? values for Exercise 6a in Sec. 12.5. 


(b) i. The histogram of pO + 268 + 67.280 values is in Fig. $.12.4. 
ii. Let z’ = (1,26,67.2) as in Example 11.5.7 of the text. To create the predictions, we take 
each of the values in the histogram in Fig. $.12.4 and add a pseudo-random normal variable 
to each with mean 0 and variance 


1/2 
[i+2(Z'Z) 12]? OW, 
We then use the sample 0.05 and 0.95 quantiles as the endpoints of our interval. In three 


separate simulations, I got the following intervals (3652, 5107), (3650,5103), and (3666, 5131). 
These are all slightly wider than the interval in Example 11.5.7. 
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Figure $.12.4: Histogram of pO + 268 + 67.280 values for Exercise6(b)i in Sec. 12.5. 
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Figure $.12.5: Histogram of predicted values for 1986 sales in Exercise 6(b)iii in Sec. 12.5. 


iii. The histogram of the sales figures used in Exercise 6(b)ii is in Fig. §.12.5. This histogram 
has more spread in it than the one in Fig. $.12.4 because the 1986 predictions equal the 1986 
parameters plus independent random variables (as described in part (b)ii). The addition of 
the independent random variables increases the variance. 


7. There are n; = 6 observations in each of p = 3 groups. The sample averages are 825.83, 845.0, and 


8. 


775.0. The w; values are 570.83, 200.0, and 900.0. In three separate simulations of size 10000 each, I 
got the following three vectors of posterior mean estimates: (826.8, 843.2, 783.3), (826.8, 843.2, 783.1), 
and (826.8, 843.2, 783.2). 


(a) To prove that the two models are the same, we need to prove that we get Model 1 when we 
integrate 71,...,T, out of Model 2. Since the 7;’s are independent, the Y;’s remain independent 
after integrating out the 7;’s. In Model 2, [Y;—(80+1 x;)|T," * has the standard normal distribution 
given 7;, and is therefore independent of 7;. Also, t;a0? has the x? distribution with a degrees of 
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freedom, so 


(Yi = (80 + Bra)!” 
1/2 
TO 
has the ¢ distribution with a degrees of freedom, which is the same as Model 1. 


(b) The prior p.d.f. is a constant times 


1? exp(—fn/2) IT |r oT? exp(—anri/2)] , 
while the likelihood is 


ll ae exp(—[y = Bo = Brxil?7s/2)| . 


i=1 
The product of these two produces Eq. (12.5.4). 


(c) As a function of 7, we have 7 to the power (na + b)/2 — 1 times e to the power of —7/2 times 
f+a>%_,7;. This is, aside from a constant factor, the p.d.f. of the gamma distribution with 
parameters (na+b)/2 and (f+a >\_, %)/2. Asa function of 7;, we have 7; to the power (a+1)/2—1 
times e to the power —7;/an + (y; — Bo — B12;)7|/2, which is (aside from a constant factor) the 
p.d.f. of the gamma distribution with parameters (a + 1)/2 and [an + (yi; — Bo — B1%i)7]/2. Asa 
function of $9, we have a constant times e to the power 


n 


1 ie 2 ily — {Ty 
~~ 7; (0 — [y; — Bia,])?/2 = er Ma G = ee +¢, 
4=1 i=1 a 

where c does not depend on (9. (Use the method of completing the square.) This is a constant 
times the p.d.f. of the normal distribution stated in the exercise. Completing the square as a 
function of 3; produces the result stated for 6, in the exercise. 


9. In three separate simulations of size 10000 each I got posterior mean estimates for (69, 61,7) of 
(—0.9526, 0.02052, 1.124 x 10-°), (—0.9593, 0.02056, 1.143 x 10-°), and (—0.9491, 0.02050, 1.138 x 10~°). 
It appears we need more than 10000 samples to get a good estimate of the posterior mean of 89. The esti- 
mated posterior standard deviations from the three simulations were (1.503 x 1077, 7.412 x 107°, 7.899 x 
10~°), (2.388 x 1077, 1.178 x 107+, 5.799 x 10~®), and (2.287 x 10~?, 1.274 x 1074, 6.858 x 107°). 


10. Let the proper prior have hyperparameters pug, Ao, @o, and {9. Conditional on the Y;’s, those X;’s that 
have Y; = 1 are an i.i.d. sample of size }>"_, Y; from the normal distribution with mean yp and precision 
T. 


(a) The conditional distribution of jy given all else is the normal distribution with mean equal to 


Ho +) ViXi : 
= , and precision equal to 7 ye Yi. 


do + S_Y; = 


(b) The conditional distribution of 7 given all else is the gamma distribution with parameters ap + 
n_Y;/2+1/2 and 
1 2,.v 2 
Bo + 5 do(u — Ho)” + >- Yi(Xi — w) 
i=1 
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(c) Given everything except Yj, 


(d) To use Gibbs sampling, we need starting values for all but one of the unknowns. For example, 
we could randomly assign the data values to the two distributions with probabilities 1/2 each or 
randomly split the data into two equal-sized subsets. Given starting values for the Y;’s, we could 
start 4 and 7 at their posterior means given the observations that came from the distribution 
with unknown parameters. We would then cycle through simulating random variables with the 
distributions in parts (a)—(c). After burn-in and a large simulation run, estimate the the posterior 
means by the averages of the sample parameters in the large simulation run. 


(e) The posterior mean of Y; is the posterior probability that Y; = 1. Since Y; = 1 is the same as the 
event that X; came from the distribution with unknown mean and variance, the posterior mean 
of Y; is the posterior probability that X; came from the distribution with unknown mean and 
variance. 


11. For this exercise, I ran five Markov chains for 10000 iterations each. For each iteration, I obtain a 
vector of 10 Y; values. Our estimated probability that X; came from the distribution with unknown 
mean and variance equals the average of the 50000 Y; values for each 7 = 1,...,10. The ten estimated 
probabilities for each of my three runs are listed below: 


Run Estimated Probabilities 

0.291 0.292 0.302 0.339 0.370 0.281 0.651 0.374 0.943 0.816 
0.285 0.286 0.302 0.339 0.375 0.280 0.656 0.371 0.945 0.819 
0.283 0.286 0.301 0.340 0.373 0.280 0.651 0.370 0.945 0.820 


1 
2 
3 


12. Note that yo should be the precision rather than the variance of the prior distribution of pu. 


(a) The prior p.d.f. times the likelihood equals a constant times 


n/2 ie eee) ee Me, = 48) oot _7Bo 

T exp ( 5 {nlzn | +s2}) exp ( 5 HH — Ho )r exp ( 5)? 
where s2 = 7, (a; —Z,)*. As a function of 7 this looks like the p.d.f. of the gamma distribution 
with parameters ag + n/2 and [n(%, — w)? + 82 + Bo]/2. As a function of 4, (by completing the 
square) it looks like the p.d.f. of the normal distribution with mean (nT%p + Moyo)/(nT +70) and 
variance 1/(nt +0). 


(b) The data summaries are n = 18, Z, = 182.17, and s? = 88678.5. I ran five chains of length 10000 
each for three separate simulations. For each simulation, I obtained 50000 parameter pairs. To 
obtain the interval, I sorted the 50000 yz values and chose the 1250th and 48750th values. For the 
three simulations, I got the intervals (154.2, 216.2), (154.6, 216.5), and (154.7, 216.2). 


13. In part (a), the exponent in the displayed formula should have been —1/2. 
(a) The conditional distribution of (4 — jig)y'/? given ¥ is standard normal, hence it is independent 


of y. Also, the distribution of 2bjy is the y? distribution with 2ag degrees of freedom. It follows 
that (4 — po) /(bp/ao)'/? has the t distribution with 2a9 degrees of freedom. 
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(b) The marginal prior distributions of 7 are in the same form with the same hyperparameters in 
Exercise 12 and in Sec. 8.6. The marginal prior distributions of are in the same form also, but 
the hyperparameters are not identical. We need aj = ao to make the degrees of freedom match, 
and we need bo = (o/Xo in order to make the scale factor match. 


(c) The prior p.d.f. times the likelihood equals a constant times 


7 = T = 
cea al i aa (- {nlzn — ul? + 52 + Bo} - Fy — po]? + xo 


As a function of 7 this is the same as in Exercise 12. As a function of js, it is also the same 
as Exercise 12 if we replace yo by y. As a function of y, it looks like the p.d.f. of the gamma 
distribution with parameters ag + 1/2 and bo + (4 — po) /2. 


(d) This time, I ran 10 chains of length 10000 each for three different simulations. The three intervals 
are found by sorting the p values and using the 2500th and 97500th values. The interval are 
(154.4, 216.3), (154.6, 215.8), and (154.4, 215.9). 


14. The exercise should have included that the prior hyperparameters are ap = 0.5, uo = 0, Ao = 1, and 
Bo = 0.5. 


(a) I used 10 chains of length 10000 each. 


(b) The histogram of predicted values is in Fig. $.12.6. There are two main differences between this 


T T T 
-2 ie) 2 


Predicted Log(Arsenic) Values (micrograms per liter) 


Count 


10000 15000 20000 25000 30000 
l 


5000 
i 


0 
l 


Figure $.12.6: Histogram of Log-arsenic predictions for Exercise 14b in Sec. 12.5. 


histogram and the one in Fig. 12.10 in the text. First, the distribution of log-arsenic is centered 
at slightly higher values in this histogram. Second, the distribution is much less spread out in this 
histogram. (Notice the difference in horizontal scales between the two figures.) 


(c) The median of predicted arsenic concentration is 1.525 in my simulation, compared to the smaller 
value 1.231 in Example 12.5.8, about 24% higher. 


15. (a) For each censored observation X;,4;, we observe only that Xnj+; <c. The probability of X,4; <¢ 


given @ is 1 — exp(—c@). The likelihood times prior is a constant times 


e"te—11 — exp(—c6)|"” exp @» *) (S.12.5) 
i=1 
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We can treat the unobserved values Xn41,...,Xn+m as parameters. The conditional distribution 
of Xpj+; given 6 and given that Xp+4; < c has the p.d-f. 
6 exp(—O0z) 
9) = ————__,, for0<4<c. $.12.6 
g(2|@) =~ for <a <e (8.12.6) 


If we multiply the conditional p.d.f. of (Xn4i,..-,;Xn+m) given @ times Eq. (8.12.5), we get 


n+m 
Beret ey (0 ~ «] : 
i=1 


for6>Oand0< 2; <cfori=n+1,...,n +m. As a function of @, this looks like the p.d-f. of 


n+m 
the gamma distribution with parameters n +m + a and _ x;. As a function of X,4;, it looks 


i=1 
like the p.d.f. in Eq. (S.12.6). So, Gibbs sampling can work as follows. Pick a starting value for 0, 
such as one over the average of the uncensored values. Then simulate the censored observations 
with p.d-f. (S.12.6). This can be done using the quantile function 
log(1 — p[1 — exp(—c8)]) 
ee 
Then, simulate a new @ from the gamma distribution mentioned above to complete one iteration. 


Gin = 


For each censored observation X,,+4;, we observe only that Xn4; >c. The probability of Xj4; > ¢ 
given 0 is exp(—c@). The likelihood times prior is a constant times 


me+ Son) ‘ (5.12.7) 
i=1 


We could treat the unobserved values Xn41,...,Xn+m as parameters. The conditional distribution 
of Xn+; given 6 and given that X,+4; > c has the p.d-f. 


g(xz|@) = Oexp(—O[x — c]), for x >c. (S.12.8) 


grtra— 1 exp (0 


If we multiply the conditional p.d.f. of (Xn4i,...,Xn+m) given @ times Eq. (S.12.7), we get 


+ 
n+m+a—1 _ — : 
0 exp {| —0 yy ti |, 

i=1 


for @ > 0 and 2; > c fori =n+1,...,n +m. As a function of 0, this looks like the p.d.f. of 


n+m 

the gamma distribution with parameters n +m + a and S- x;. As a function of X,4;, it looks 
i=l 

like the p.d.f. in Eq. (S.12.8). So, Gibbs sampling can work as follows. Pick a starting value for 


0, such as the M.L.E., imam 


nm 
me+ Sx; 
1 


i= 
This can be done using the quantile function 


. Then simulate the censored observations with p.d.f. (S.12.8). 


~— es(l=p) 
0 
Then, simulate a new @ from the gamma distribution mentioned above to complete one interaction. 
In this part of the exercise, Gibbs sampling is not really needed because the posterior distribution 
of @ is available in closed form. Notice that (S.12.7) is a constant times the p.d.f. of the gamma 
n 


G"\(p) = 


distribution with parameters n + a and mc + Ss x;, which is then the posterior distribution of 6. 
i=1 
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16. (a) The joint p.d.f. of (X;,Z;) can be found from the joint p.d.f. of (X;, Y;) and the transformation 
h(x,y) = (z,2 + y). The joint p.d.f. of (X;, Y;) is f(a, y) = Auexp(—axA — yp) for x,y > 0. The 
inverse transformation is h~!(x,z) = (x,z — x), with Jacobian equal to 1. So, the joint p.d.f. of 

g(x, z) = f(x, z— 2x) = Apexp(—2[A — pw] — zu), for0O< a<z,z>0. 
The marginal p.d.f. of Z; is the integral of this over x, namely 

N 

g2(z) = an [1 — exp(—z[A — p])] exp(—zy), 
for z > 0. The conditional p.d.f. of X; given Z; = z is the ratio 

g(a, 2) a A— pb 

gx(z) 1 —exp(—2[\ — p)) 
The conditional c.d.f. of X; given Z; = z is the integral of this, which is the formula in the text. 

(b) The likelihood times prior is 


exp(—2[A — J), for0< a4 < z. (S.12.9) 


\rta— 1,n+b-1 n 


k k 
Ta nk exXP (-A}ox — “dou II [1 — exp (—[A — p]z;)]. (S.12.10) 
(A — 4) i=l i=1 / i=k+1 

We can treat the unobserved pairs (X;, Y;) fori = k+1,...,n as parameters. Since we observe 
X;+ Y; = Z, we shall just treat X; as a parameter. The conditional p.d.f. of X; given the other 
parameters and Z; is in (S.12.9). Multiplying the product of those p.d.f.’s fori = k+1,...,n 
times (5.12.10) gives 


n k n 
\ite-l ntb-l exp (a3 =p 3: yi + > (zi - “ ; (8.12.11) 
i=l i=1 


i=k+1 
where 0 < a; < 2% fori =k+1,...,n. As a function of A, (S.12.11) looks like the p.d.f. of the 


n 
gamma distribution with parameters n +a and x x;. As a function of py it looks like the p.d-f. of 


i=1 
n 


the gamma distribution with parameters n+ b and ‘ y;. As a function of 2; (@=k+1,...,n), it 


looks like the p.d-f. in (S.12.9). So, Gibbs sampling a work as follows. Pick starting values for 
and A, such as one over the averages of the observed values of the X;’s and Y;’s. Then simulate the 
unobserved X; values for i = k+1,...,n using the probability integral transform. Then simulate 
new A and p values using the gamma distributions mentioned above to complete one iteration. 


12.6 The Bootstrap 


Commentary 


The bootstrap has become a very popular technique for solving non-Bayesian problems that are not amenable 
to analysis. The nonparametric bootstrap can be implemented without much of the earlier material in this 
chapter. Indeed, one need only know how to simulate from a discrete uniform distribution (Example 12.3.11) 
and compute simulation standard errors (Sec. 12.2). 

The software R has a function boot that is available after issuing the command library (boot). The 
first three arguments to boot are a vector data containing the original sample, a function f to compute the 
statistic whose distribution is being bootstrapped, and the number of bootstrap samples to create. For the 
nonparametric bootstrap, the function £ must have at least two arguments. The first will always be data, and 
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the second will be a vector inds of integers of the same dimension as data. This vector inds will choose the 
bootstrap sample. The function should return the desired statistic computed from the sample data[inds]. 
Any additional arguments to f can be passed to boot by setting them explicitly at the end of the argument 
list. For the parametric bootstrap, boot needs the optional arguments sim="parametric" and ran.gen. 
The function ran.gen tells how to generate the bootstrap samples, and it takes two arguments. The first 
argument will be data. The second argument is anything else that you need to generate the samples, for 
example, estimates of parameters based on the orginal data. Also, £f needs at least one argument which will 
be a simulated data set. Any additional arguments can be passed explicitly to boot. 


Solutions to Exercises 


1; 


We could start by estimating 9 by the M.L.E., 1/X. Then we would use the exponential distribution 
with parameter 1/X for the distribution F in the bootstrap. The bootstrap estimate of the variance 
of X is the variance of a sample average X of a sample of size n from the distribution F’, i.e., the 
exponential distribution with parameter 1/X. The variance of X isl /n times the variance of a single 
observation from F’, which equals x. So, the bootstrap estimate is ie /n. 


. The numbers 1, ..., Zn are known when we sample from F;,. Let 71,...,in € {1,...,n}. Since Xj = 2%, 


if and only if J; = 7;, we can compute 
nm n 
Pr = Pgh hy, SO) = FP Sead Sn) = |] Pep =a) = |] PG =a). 


The second equality follows from the fact that Jj,...,J, are a random sample with replacement from 
the set {1,..., 7}. 


. Let n = 2k +1. The sample median of a nonparametric bootstrap sample is the k + 1st smallest 


observation in the bootstrap sample. Let 2 denote the smallest observation in the original sample. 
Assume that there are ¢ observations from the original sample that equal x. (Usually @ = 1, but it 
is not necessary.) The sample median from the bootstrap sample equals x from the original data set 
if and only if at least k + 1 observations in the bootstrap sample equal x. Since each observation 
in the bootstrap equals x with probability @/n and the bootstrap observations are independent, the 
probability that at least k + 1 of them equal zx is 


SOC) 


. For each bootstrap sample, compute the sample median. The bias estimate is the average of all of these 


sample medians minus the original sample median, 201.3. I started with a pilot sample of size 2000 
and estimated the bias as 0.545. The sample variance of the 2000 sample medians was 3.435. This led 
me to estimate the necessary simulation size as 


1/2]? 
ee) 3.435 | so. 


2 0.02 


So, I did 30000 bootstrap samples. The new estimate of bias was 0.5564, with a simulation standard 
error of 0.011. 


. This exercise is performed in a manner similar to Exercise 4. 


8. 


9. 
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(a) In this case, I did three simulations of size 50000 each. The three estimates of bias were —1.684, 
—1.688, and —1.608. 


(b) Each time, the estimated sample size needed to achieve the desired accuracy was between 48000 
and 49000. 


(a) For each bootstrap sample, compute the sample median. The estimate we want is the sample 
variance of these values. I did a pilot simulation of size 2000 and got a sample variance of 18.15. 
I did another simulation of size 10000 and got a sample variance of 18.87. 


(b) To achieve the desired accuracy, we would need a simulation of size 


1+40.95\ 18.871/2]? 
fe (=) pare = 2899533. 


That is, we would need about three million bootstrap samples. 


(i) 


having a normal distribution with mean 0 and variance 
31.65/11, yo having the normal distribution with mean 0 and variance 68.8/10, SY being 31.65 
times a x? random variable with 10 degrees of freedom, and see) being 68.8 times a y? random 
variable with 9 degrees of freedom. For each sample, we compute the statistic U displayed in 
Example 12.6.10 in the text. We then compute what proportion of the absolute values of the 
10000 statistics exceed the 0.95 quantile of the ¢ distribution with 19 degrees of freedom, 1.729. 
In three separate simulations, I got proportions of 0.1101, 0.1078, and 0.1115. 

(b) To correct the level of the test, we need the 0.9 quantile of the distribution of |U|. For each 
simulation, we sort the 10000 jo | values and select the 9000th value. In my three simulations, 
this value was 1.773, 1.777, and 1.788. 

(c) To compute the simulation standard error of the sample quantile, I chose to split the 10000 samples 
into eight sets of size 1250. For each set, I sort the |U| values and choose the 1125th one. The 
simulation standard error is then the the square-root of one-eighth of the the sample variance of 
these eight values. In my three simulations, I got the values 0.0112, 0.0136, and 0.0147. 


(a) Each bootstrap sample consists of X~ 


The correlation is the ratio of the covariance to the square-root of the product of the variances. The 
n 


mean of X* is E(X*) = X, and the mean of Y* is E(Y*) = Y. The variance of X* is pwe.2 —X)*/n, 
7 7 i=1 
and the variance of Y* is so, —Y)?/n. The covariance is 
i=1 


BU(xX* XY" -¥)] = =) -(% —*)U -¥). 
1=1 


Dividing this by the square-root of the product of the variances yields (12.6.2). 


(a) For each bootstrap sample, compute the sample correlation R®. Then compute the sample 
variance of R,..., RO), This is the approximation to the bootstrap estimate of the variance 
of the sample correlation. I did three separate simulations and got sample variances of 4.781 x 10~4, 
4.741 x 10-4, and 4.986 x 1074. 


(b) The approximation to the bootstrap bias estimate is the sample average of R®,..., RO) minus 
the original sample correlation, 0.9670. In my three simulations, I got the values —0.0030, —0.0022, 
and —0.0026. It looks like 1000 is not enough bootstrap samples to get a good estimate of this 
bias. 
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For the simulation standard error of the variance estimate, we use the square-root of Eq. (12.2.3) 
where each Y in (12.2.3) is R® in this exercise. In my three simulations, I got the values 
2.231 x 107°, 2.734 x 107°, and 3.228 x 107°. For the simulation standard error of the bias 
estimate, we just note that the bias estimate is an average, so we need only calculate the square- 
root of 1/1000 times the sample variance of R®,..., RO) In my simulations, I got 6.915 x 1074, 
6.886 x 107+, and 7.061 x 1074. 


10. For both parts (a) and (b), we need 10000 bootstrap samples. From each bootstrap sample, we compute 


ills 


the 


sample median. Call these values M*), for i = 1,...,10000. The median of the original data is 


M = 152.5. 


(a) 


(b) 


Sort the M* values from smallest to largest. The percentile interval just runs from the 500th 
sorted value to the 9500th sorted value. I ran three simulations and got the following three 
intervals: [148,175], [148,175], and [146.5, 175]. 


Choose a measure of spread and compute it from the original sample. Call the value Y. For 
each bootstrap sample, compute the same measure of spread Y*. I choose the median absolute 
deviation, which is Y = 19 for this data set. Then sort the values (M* — M)/Y*®., Find the 
500th and 9500th sorted values Z599 and Zo509. The percentile-t confidence interval runs from 
M—- ZY toM+ZY. In my three simulations, I got the intervals [143,181], [142.6,181], and 
[141.9, 181]. 


The sample average of the beef hot dog values is 156.9, and the value of o’ is 22.64. The confidence 
interval based on the normal distribution use the ¢ distribution quantile Tj'(0.95) = 1.729 and 
equals 156.9 + 1.729 x 22.64/20!/2, or [148.1,165.6]. This interval is considerably shorter than 
either of the bootstrap intervals. 


If X* has the distribution F,,, then wp = E(X*) = X, 
i _ 
o = Var(X*)= —SO(X; — X)*, and 


1 


i =e 
eek 


E([X — y]’) 


Plugging these values into the formula for skewness (see Definition 4.4.1) yields the formula for 
M3 given in this exercise. 


n 
The summary statistics of the 1970 fish price data are X = 41.1, SOX — X)*/n = 1316.5, and 


i=1 
n 


SOX — X)?/n = 58176, so the sample skewness is M3 = 1.218. For each bootstrap sample, 
i=l 

we also compute the sample skewness M3 ® for j= 1,...,1000. The bias of M3 is estimated by 
the sample average of the M3 >; minus Mg3. I did three simulations and got the values —0.2537, 
—0.2936, and —0.2888. To estimate the standard deviation of M3, compute the sample standard 
deviation of the Mis, In my three simulations, I got 0.5480, 0.5590, and 0.5411. 


12. We want to show that the distribution of R is the same for all parameter vectors (ji, MgO. p) 
that share the same value of p. Let 0; = (Hart Myls 721 F415 P) and @5 = (Hax2; My2s 0295 F495 P) be two 
parameter vectors that share the same value of p. Let az = o72/0x1, Gy = Cy2/Oy1, br = [a2 — Mel; 


and 


by — fly2 — yi. For i = 1,2, let W; be a sample of size n from a bivariate normal distribution 


with parameter vector 0;, and let R; be the sample correlation. We want to show that R, and R2 have 
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the same distribution. Write W; = [(Xi1, Yi1),-.-, (Xin, Yin)| for 1 = 1,2. Define Xi; = a;(Xiy + bz), 
Yj, = ay(Yij + by) for j =1,...,n. Then it is trivial to see that W = [(X31, Y31),---;(X4ns Yon)] has 
the same distribution as W2. Let R4 be the sample correlation computed from W5. Then R4 and R 


have the same distribution. We complete the proof by showing that R = R,. Hence R} and R, and 
nm 


R> all have the same distribution. To see that R5 = Rj, let Xi = S- Xj; and similarly bar x and 
j=l 

Y,. Then Xy = az(X) + bz) and Ys = dy(Y1+ by). So, for each J, X55 — X, = a,(X1; — X1) and 

Yeo Y} = ay(Yi; —Y1). Since ay, ay > 0, it follows that 


1/2 


n 
Ay Ay S$" Gy = Aa)(¥ li; = Y1,) 
j=l 


Coco 


nm 
=! 
yo X34 — Xo) (Yo; - Yo) 
j=1 


(|S, x 


jl 


3; - Y2) 


ji 


7) 1/2 


12.7 Supplementary Exercises 


Solutions to Exercises 


1. For the random number generator that I have been using for these solutions, Fig. $.12.7 contains 
one such normal quantile plot. It looks fairly straight. On the horizontal axis I plotted the sorted 


Simulated values 


Normal quantiles 


Figure $.12.7: Normal quantile plot for Exercise 1 in Sec. 12.7. A straight line has been added for reference. 


pseudo-normal values and on the vertical axis, I plotted the values ®~!(i/10001) for i = 1,..., 10000. 


2. The plots for this exercise are formed the same way as that in Exercise 1 except we replace the normal 
pseudo-random values by the appropriate gamma pseudo-random values and we replace ®~! by the 
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quantile function of the appropriate gamma distribution. Two of the plots are in Fig. $.12.8. The plots 
are pretty straight except in the extreme upper tail, where things are expected to be highly variable. 


Simulated values 


Gamma(0.5) quantiles 


Simulated values 


Gamma(10) quantiles 


Figure $.12.8: Gamma quantile plots for Exercise 2 in Sec. 12.7. The left plot has parameters 0.5 and 1 and 
the right plot has parameters 10 and 1. Straight lines have been added for reference. 


3. Once again, the plots are drawn in a fashion similar to Exercise 1. This time, we notice that the plot 
with one degree of freedom has some really serious non-linearity. This is the Cauchy distribution which 
has very long tails. The extreme observations from a Cauchy sample are very variable. Two of the 


plots are in Fig. $.12.9. 


Simulated values 


-5000 0 5000 10000 


-10000 


i i T i i T i 
-3000 -1000 0 1000 3000 


t quantiles with 1 degree of freedom 


Simulated values 


r r r r r 
4 2 0) 2 4 


t quantiles with 20 degrees of freedom 


Figure $.12.9: Two t quantile plots for Exercise 3 in Sec. 12.7. The left plot has 1 degree of freedom, and 
the right plot has 20 degrees of freedom. Straight lines have been added for reference. 


4, 


(a) I simulated 1000 pairs three times and got the following average values: 1.478, 1.462, 1.608. It 
looks like 1000 is not enough to be very confident of getting the average within 0.01. 


(b) Using the same three sets of 1000, I computed the sample variance each time and got 1.8521, 


1.6857, and 2.5373. 


(c) Using (12.2.5), it appears that we need from 120000 to 170000 simulations. 


5. 


8. 


(a) 


(a) 
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To simulate a noncentral ¢ random variable, we can simulate independent Z and W with Z having 
the normal distribution with mean 1.936 and variance 1, and W having the y? distribution with 
14 degrees of freedom. Then set T = Z/(W/14)'/?. 


I did three separate simulations of size 1000 each and got the following three proportions with 
T > 1.761: 0.571, 0.608, 0.577. The simulation standard errors were 0.01565, 0.01544, and 0.01562. 


Using (12.2.5), we find that we need a bit more than 16000 simulated values. 


For each sample, we compute the numbers of observations in each of the four intervals (—oo, 3.575), 
(3.575, 3.912), [3.912, 4.249), and [4.249, 00). Then we compute the Q statistic as we did in Exam- 
ple 10.1.6. We then compare each Q statistic to the three critical values 7.779, 9.488, and 13.277. 
We compute what proportion of the 10000 Q’s is above each of these three critical values. I did 
three separate simulations of size 10000 each and got the proportions: 0.0495, 0.0536, and 0.0514 
for the 0.9 critical value (7.779). I got 0.0222, 0.0247, and 0.0242 for the 0.95 critical value (9.488). 
I got 0.0025, 0.0021, and 0.0029 for the 0.99 critical value (13.277). It looks like the test whose 
nominal level is 0.1 has size closer to 0.05, while the test whose nominal level is 0.05 has level 
closer to 0.025. 


For the power calculation, we perform exactly the same calculations with samples from the different 
normal distribution. I performed three simulations of size 1000 each for this exercise also. I got 
the proportions: 0.5653, 0.5767, and 0.5796 for the 0.9 critical value (7.779). I got 0.4560, 0.4667, 
and 0.4675 for the 0.95 critical value (9.488). I got 0.2224, 0.2280, and 0.2333 for the 0.99 critical 
value (13.277). 


We need to compute the same Q statistics as in Exercise 6(b) using samples from ten different 
normal distributions. For each of the ten distributions, we also compute the 0.9, 0.95 and 0.99 
sample quantiles of the 10000 Q statistics. Here is a table of the simulated quantiles: 


Quantile 
ee 0.9 0.95 0.99 
3.8 0.25] 3.891 4.976 7.405 
3.8 0.80 | 4.295 5.333 8.788 
3.9 0.25 | 3.653 4.764 6.405 
3.9 0.80 | 4.142 5.133 7.149 
4.0 0.25 | 3.825 5.104 7.405 
4.0 0.80 | 4.554 5.541 8.635 
4.1 0.25 | 3.861 5.255 8.305 
4.1 0.80 | 4.505 5.658 8.637 
4.2 0.25] 4.193 5.352 8.260 
4.2 0.80 | 4.087 4.981 7.677 


The quantiles change a bit as the distributions change, but they are remarkably stable. 


Instead of starting with normal samples, we start with samples having at distribution as described 
in the exercise. We compute the Q statistic for each sample and see what proportion of our 10000 
Q statistics is greater than 5.2. In three simulations of this sort I got proportions of 0.12 0.118, 
and 01.24. 


The product of likelihood times prior is 


on (— wl = wo)? -on aa? iy Bete oF) 


428 


Chapter 12. Simulation 


Pp 
x  \P/2+70-1 exp(—Ad) ‘at oot [nit l]/2-1 
z p) 
i=1 
ni 
where w; = Gey —4%,)° fori =1,...,p. 
j=l 


As a function of j1;, 7, or = this looks the same as it did in Example 12.5.6 except that Ag needs 
to be replaced by A wherever it occurs. As a function of X, it looks like the p.d.f. of the gamma 


P 
distribution with parameters p/2 +70 and 69 + > Ti (py — h)* /2. 
i=1 


I ran six Markov chains for 10000 iterations each, producing 60000 parameter vectors. The re- 
quested posterior means and simulation standard errors were 


Parameter M1 L2 L3 [4 1/T 1/T 1/tT3 1/va 


Posterior mean 156.9 158.7 118.8 160.6 486.7 598.8 479.2 548.4 
Sim. std. err. 0.009583 0.01969 0.02096 0.01322 0.8332 0.8286 0.5481 0.9372 


The code at the end of this manual was modified following the suggestions in the exercise in order 
to produce the above output. The same was done in Exercise 9. 


The product of likelihood times prior is 


P ‘(ne ap Ne  _ af)2 
a (-2e ar ys 5 Seg an oY) 
i=1 


P 
x gpooteo—l exp(—8¢o) II goer eer teak 
7 
i=1 
Ny 
where w; = Sa = 9) for d= 1.1440. 
j=l 
As a function of j1;, 7, or y this looks the same as it did in Example 12.5.6 except that 69 needs 
to be replaced by @ wherever it occurs. As a function of (, it looks like the p.d.f. of the gamma 
p 
distribution with parameters pag + €9 and ¢o9 + i> Te: 
i=1 


I ran six Markov chains for 10000 iterations each, producing 60000 parameter vectors. The re- 
quested posterior means and simulation standard errors were 


Parameter Ly [2 [3 [a 1/7 1/T 1/rT3 1/T% 


Posterior mean | 156.6 158.3 120.6 159.7 495.1 609.2 545.3 570.4 
Sim. std. err. 0.01576 0.01836 0.02140 0.03844 0.4176 1.194 0.8968 0.7629 


The numerator of the likelihood ratio statistic is the maximum of the likelihood function over 
all parameter values in the alternative hypothesis, while the denominator is the maximum of the 
likelihood over all values in the null hypothesis. Both the numerator and denominator have a factor 


k 
of I e that will divide out in the ratio, so we shall ignore these factors. In this example, the 
i=1 a 


maximum over the alternative hypothesis will be the maximum over all parameter values, so we 
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would set pj = X;/n; in the likelihood to get 


k 
il oe (nj = De aes 


1 (#)" (-%)"" = 
5 Ny k 


For the denominator, all of the p; are equal, hence the likelihood to be maximized is p*!+""+** (1 — 


k k 
pyrite tre—X1—"—Xe | This is maximized at p = Ay / > nj, to yield 
j=l j=l 
k eee k Dogar 5 —-X5) k iat / 5 dapat Xy) 

Xs; > xy >; do (nj — X;) 
j=l 122 i = 

fl] a : 
j=l j=l » my 

lao 


The ratio of these two maxima is a positive constant times the statistic stated in the exercise. The 
likelihood ratio test rejects the null hypothesis when the statistic is greater than a constant. 


(b) Call the likelihood ratio test statistic T. The distribution of T, under the assumption that Hp is 
true, that is py) = --- = px still depends on the common value of the p,’s, call it p. If the sample 
sizes are large, the distribution should not depend very much on p, but it will still depend on 
p. Let F,(-) denote the c.d.f. of T’ when p is the common value of all p;’s. If we reject the null 
hypothesis when T > c, the test will be of level ag so long as 


1 — F,(c) < ao, for all p. (S.12.12) 


If c satisfies (S.12.12) then all larger c satisfy (S.12.12), so we want the smallest c that satisfies 
(S.12.12). Eq. (S.12.12) is equivalent to F,(c) > 1 — ap for all p, which, in turn, is equivalent to 
c> (1 — apo) for all p. The smallest c that satisfies this last inequality is c = sup, F, 1(1— a). 
To approximate c by simulation, proceed as follows. Pick a collection of reasonable values of p and 
a large number v of simulations to perform. For each value of p, perform v simulations as follows. 
Simulate & independent binomial random variables with parameters n; and p, and compute the 
value of T. Sort the v values of T and approximate eg il — ao) by the (1 — ag)vth sorted value. 
Let c be the largest of these values over the different chosen values of p. It should be clear that 
the distribution of T is the same for p as it is for 1 — p, so one need only check values of p between 
0 and 1/2. 

(c) To compute the p-value, we first find the observed value ¢ of T’, and then find sup, Pr(T' > t) under 
the assumption that the each p; = p fori = 1,...,k. In Table 2.1, the X; values are X; = 22, 
Xq = 25, X3 = 16, X4 = 10, while the sample sizes are n, = 40, ng = 38, n3 = 38, ng = 34. The 
observed value of T’ is 

227718257713" 16'° 22741024" 
- 73737777 
A pilot simulation showed that the maximum over p of 1 — F,(t) occurs at p = 0.5, so a larger 


simulation was performed with p = 0.5. The estimated p-value is 0.01255 with a simulation 
standard error of 0.0039. 


= exp(—202.17). 


11. (a) We shall use the same approach as in Exercise 12 of Sec. 12.6. Let the parameter be 6 = (1,01, 02) 
(where y is the common value of 4; = jg). Each pair of parameter values 6 and 6’ that have the 
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same value of 02/0, can be obtained from each other by multiplying 4, 0; and o2 by the same 
positive constant and adding some other constant to the resulting yz. That is, there exist a > 0 and 
b such that 6’ = (a+ b,ao1,a02). If X1,...,Xm and Yj,...,Y;, have the distribution determined 
by 6, then Xj = aX; +6 fori =1,...,m and Yj = aY; +b for j = 1,...,n have the distribution 
determined by 6’. We need only show that the statistic V in (9.6.13) has the same value when it 
is computed using the X;’s and Y;’s as when it is computed using the X/’s and Y; ’s. It is easy to 
see that the numerator of V computed with the X/’s and Y} equals a times the numerator of V 
computed using the X;’s and Y;’s. The same is true of the denominator, hence V has the same 
value either way and it must have the same distribution when the parameter is 0 as when the 
parameter is 6’. 


By the same reasoning as in part (a), the value of v is the same whether it is calculated with 
the X;’s and Y;’s or with the X}’s and Y/’s. Hence the distribution of v (thought of as a random 
variable before observing the data) depends on the parameter only through 02/01. 


For each simulation with ratio r, we can simulate X,, having the standard normal distribution and 
S% having the y? distribution with 9 degrees of freedom. Then simulate Y,, having the normal 
distribution with mean 0 and variance r? and $? equal to r? times a y? random variable with 10 
degrees of freedom. Make the four random variables independent when simulating. Then compute 
V and v. Compute the three quantiles T7'(0.9), T71(0.95) and T7-1(0.99) and check whether V 
is greater than each quantile. Our estimates are the proportions of the 10000 simulations in which 
the value of V are greater than each quantile. Here are the results from one of my simulations: 


Probability 
r 0.9 0.95 0.99 


1.0 | 0.1013 0.0474 0.0079 
1.5 | 0.0976 0.0472 0.0088 
2.0 | 0.0979 0.0506 0.0093 
3.0 | 0.0973 0.0463 0.0110 
5.0 | 0.0962 0.0476 0.0117 
10.0 | 0.1007 0.0504 0.0113 


The upper tail probabilities are very close to their nominal values. 


12. I used the same simulations as in Exercise 11 but computed the statistic U from (9.6.3) instead of V 


and compared U to the quantiles of the ¢ distribution with 19 degrees of freedom. The proportions are 
below: 


Probability 
r 0.9 0.95 0.99 


1.0 | 0.1016 0.0478 0.0086 
1.5 | 0.0946 0.0461 0.0090 
2.0 | 0.0957 0.0483 0.0089 
3.0 | 0.0929 0.0447 0.0112 
5.0 | 0.0926 0.0463 0.0124 
10.0 | 0.0964 0.0496 0.0121 


These values are also very close to the nominal values. 


13. (a) The fact that E (31) = }; depends only on the fact that each Y; has mean fo + x;81. It does not 


depend on the distribution of Y; (as long as the distribution has finite mean). Since ( is a linear 
function of Yj,...,Y,, its variance depends only on the variances of the Y;’s (and the fact that 
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they are independent). It doesn’t depend on any other feature of the distribution. Indeed, we can 


write 
b= 234% 
Dace — Tay 4 
j=l 
where a; = (4; — En)/ ve —F,)”. Then Var(31) = 7%, a? Var(Y;). This depends only on the 


variances of the Y;’s, one do not depend on (6 or 6. 


(b) Let T have the ¢ distribution with k degrees of freedom. Then Y; has the same distribution as 
Bo + Bix; + oT, whose variance is 0? Var(T). Hence, Var(Y;) = 0? Var(T). It follows that 
n 
Var(3,) = 0? Var(T) SC a?. 
i=l 
Let v = Var(T) 77, a?. 
(c) There are several possible simulation schemes to estimate v. The simplest might be to notice that 


n 
1 

24 = SG 

dj=1(2i — Fn) 


so that we only need to estimate Var(7’). This could be done by simulating lots of ¢ random 
variables with k& degrees of freedom and computing the sample variance. In fact, we can actually 
calculate v in closed form if we wish. According to Exercise 1 in Sec. 8.4, Var(T’) = k/(k — 2). 


14. As we noted in Exercise 13(c), the value of v is 


= 3.14 x 107°. 


15. (a) We are trying to approximate the value a that makes f(a) = E[L(0,a)|a] the smallest. We 
have a ee 6)....,@™ from the posterior distribution of 0, so we can approximate l(a) by 


= yu L(0’,a)/v. We could then do a search through many values of a to find the value that 


minimizes ba ). We could use either brute force or mathematical software for minimization. Of 
course, we would only have the value of a that minimizes @(a) rather than @(a). 


(b) To compute a simulation standard error, we could draw several (say k) samples from the posterior 
(or split one large sample into k smaller ones) and let Z; be the value of a that minimizes the ith 
version of £. Then compute S$ in Eq. (12.2.2) and let the simulation standard error be S/k!/2. 


16. (In the displayed formula, on the right side of the = sign, all @’s should have been p’s.) The posterior 
hyperparameters are all given in Example 12.5.2, so we can simulate as many p values as we want to 
estimate the posterior mean of L(@,a). We simulated 100000 t random variables with 22 degrees of 
freedom and multiplied each one by 15.214 and added 183.95 to get a sample of w values. For each 
value of a near 183.95, we computed é(a) and found that a = 182.644 gave the smallest value. We then 
repeated the entire exercise for a total of five times. The other four a values were 182.641, 182.548, 
182.57 and 182.645. The simulation standard error is then 0.0187 
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R Code For Two Text Examples 


If you are not using F or if you are an expert, you should not bother reading this section. 

The code below (with comments that start #) is used to perform the calculations in Examples 12.5.6 
and 12.5.7 in the text. The reason that the code appears to be so elaborate is that I realized that Exercises 8 
and 9 in Sec. 12.7 asked to perform essentially the same analysis, each with one additional parameter. 
Modifying the code below to handle those exercises is relatively straightforward. Significantly less coding 
would be needed if one were going to perform the analysis only once. For example, one would need the three 
functions that simulate each parameter given the others, plus the function called hierchain that could be 
used both for burn-in and the later runs. The remaining calculations could be done by typing some additional 
commands at the R prompt or in a text file to be sourced. 

In the first printing, there was an error in these examples. For some reason (my mistake, obviously) the 
w; values were recorded in reverse order when the simulations were performed. That is, w4 was used as if 
it were w 1, w3 was used as if it were we, etc. The 7; and n; values were in the correct order, otherwise the 
error could have been fixed by reordering the hot dog type names, but no such luck. Because the w; were 
such different numbers, the effect on the numerical output was substantial. Most notably, the means of the 
1/7; are not nearly so different as stated in the first printing. 

The data file hotdogs.csv contains four columns separated by commas with the data in Table 11.15 
along with a header row: 


Beef ,Meat ,Poultry,Specialty 
186,173,129,155 
181,191,132,170 
176,182,102,114 
149,190,106,191 
184,172,94,162 
190,147,102,146 
158,146,87,140 
139,139,99,187 
175,175,107,180 
148,136,113,, 
152,179,135,, 
111,153,142,, 
141,107,86,, 
153,195,143,, 
190,135,152,, 
157,140,146,, 
131,138,144,, 
149,,, 

135,,, 
132,,, 


The commas with nothing after them indicate that the data in the next column has run out already, and 
NA (not available) values will be produced in R. Most R functions have sensible ways to deal with NA values, 
generally by including the optional argument na.rm=T or something similar. By default, R uses all values 
(including NA’s) to compute things like mean or var. Hence, the the result will be NA if one does not change 
the default. 

First, we list some code that sets up the data, summary statistics, and prior hyperparameters. The lines 
that appear below were part of a file hotdogmcmc-example.r. They were executed by typing 
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source ("hotdogmcmc-example.r") 
at the R prompt in a command window. The source function reads a file of text and treats each line as if 
it had been typed at the R prompt. 


# Read the data from a comma-separated file with a header row. 
hotdogs=read.table("hotdogs.csv",header=T,sep=",") 

# Compute the summary statistics 

# First, the sample sizes: how many are not NA? 

n=apply (hotdogs, 2,function(x) {sum(!is.na(x))}) 

# Next, the sample means: remember to remove the NA values 

ybar=apply (hotdogs,2,mean,na.rm=T) 

# Next, the wi values (sum of squared deviations from sample mean) 
w=apply (hotdogs,2,var,na.rm=T) *(n-1) 

# Set the prior hyperparameters: 

hyp=list (lambda0=1, alphaO=1, beta0=0.1, u0=0.001, psi0=170) 

# Set the initial values of parameters. These will be perturbed to be 
# used as starting values for independent Markov chains. 

tau=(n-1)/w 
psi=(hyp$psi0*hyp$u0+hyp$lambda0*sum(tau*ybar) ) / (hyp$u0t+thyp$lambda0*sum (tau) ) 
mu=(n*ybar+hyp$lambda0*psi) / (nt+thyp$lambda0) 


Next, we list a series of functions that perform major parts of the calculation. The programs are written 
specifically for these examples, using variable names like ybar, n, w, mu, tau, and psi so that the reader can 
easily match what the programs are doing to the example. If one had wished to have a general hierarchical 
model program, one could have made the programs more generic at the cost of needing special routines to 
deal with the particular structure of the examples. Each of these functions is stored in a text file, and the 
source function is used to read the lines which in turn define the function for use by R. That is, after each 
file has been “sourceed,” the function whose name appears to the left of the = sign becomes available for 
use. It’s arguments appear in parentheses after the word function on the first line. 

First, we have the functions that simulate the next values of the parameters in each Markov chain: 


mugen=function(i,tau,psi,n,ybar,w,hyp){ 


# 

# Simulate a new mu[i] value 

# 

(nli] *ybar [i] thyp$lambda0*psi) /(n[i]+hyp$lambda0)+rnorm(1,0,1)/sqrt (tau[i] * 
(n[i]+hyp$lambda0) ) 

- 

taugen=function(i,mu,psi,n,ybar,w,hyp){ 

# 

# Simulate a new tau[i] value 

# 


rgamma(1,hyp$alpha0+0.5*(n[i]+1))/(hyp$beta0+0.5* (wli]+n[i] *(mu[i]-ybar [i])~2+ 
hyp$lambda0* (mu [i] -psi)~2)) 

} 

psigen=function(mu,tau,n,ybar,w,hyp) { 

# 

# Simulate a new psi value 

i 
(hyp$psi0*hyp$u0+hyp$lambda0*sum(tau*mu) )/ (hyp$u0t+thyp$lambda0*sum (tau) )+ 
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rnorm(1,0,1)/sqrt (hyp$u0thyp$lambda0*sum (tau) ) 


Next is the function that does burn-in and computes the F’ statistics described in the text. If the F 
statistics are too large, this function would have to be run again from the start with more burn-in. One could 
rewrite the function to allow it to start over from the end of the previous burn-in if one wished. (One would 
have to preserve the accumulated means and sums of squared deviations.) 


burnchain=function(nburn, start ,nchain,mu,tau,psi,n,ybar,w,hyp,stand){ 
# 
# Perform "nburn" burn-in for "nchain" Markov chains and check the F statistics 


tf starting after "start". The initial values are "mu", "tau", "psi" 
# and are perturbed by "stand" times random variables. The data are 
# "n", "ybar", "w". The prior hyperparameters are "hyp". 

# 


# ngroup is the number of groups 
ngroup=length(ybar) 
# Set up the perturbed starting values for the different chains. 
# First, store 0 in all values 
muval=matrix(0,nchain,ngroup) 
tauval=muval 
psival=rep(0,nchain) 
# Next, for each chain, perturb the starting values using random 
# normals or lognormals 
for(1 in 1:nchain){ 
muval [1, ]=mu+stand*rnorm(ngroup) /sqrt (tau) 
tauval [1,]=tau*exp (rnorm(ngroup) *stand) 
psival [1]=psit+stand*rnorm(1)/sqrt (hyp$u0) 
# Save the starting vectors for all chains just so we can see what 
# they were. 
startvec=cbind(muval,tauval,psival) 


} 
# The next matrices/vectors will store the accumulated means "...a" and sums 
# of squared deviations "...v" so that we don’t need to store all of the 


# burn-in simulations when computing the F statistics. 
# See Exercise 23(b) in Sec. 7.10 of the text. 
muacca=matrix(0,nchain,ngroup) 
tauacca=muacca 
psiacca=rep(0,nchain) 
muaccv=muacca 
tauaccv=muacca 
psiaccv=psiacca 
# The next matrix will store the burn-in F statistics so that we can 
tf see if we need more burn-in. 
fs=matrix(0,nburn-startt1,2*ngroupt1) 
# Loop through the burn-in 
for(i in 1:nburn){ 
# Loop throught the chains 
for(1 in 1:nchain){ 
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# Loop through the coordinates 

for(j in 1:ngroup){ 

# Generate the next mu 
muval[1,j]=mugen(j,tauval[1,],psival[1] ,n,ybar,w,hyp) 

# Accumulate the average mu (muacca) and the sum of squared deviations (muaccv) 
muaccv[1,j]=muaccv[1,j]+(i-1)*(muval[1,j]-muacca[1,j])*2/i 
muacca[1,j]=muacca[1,j]+(muval[1, j]-muaccal[1,j])/i 

# Do the same for tau 
tauval[1, j]=taugen(j,muval[1,],psival[1] ,n,ybar,w,hyp) 
tauaccv[1,j]=tauaccv[1,j]+(i-1)*(tauval[1,j]-tauacca[1,j])*2/i 
tauacca[1,j]=tauacca[1,j]+(tauval[1,j]-tauaccal[1,j])/i 

} 
# Do the same for psi 
psival [1]=psigen(muval[1,],tauval[1,],n,ybar,w,hyp) 
psiaccv[1]=psiaccv[1]+(i-1)*(psival[1]-psiacca[1])*2/i 
psiacca[1]=psiacca[1]+(psival[1]-psiacca[1])/i 
} 
# Once we have enough burn-in, start computing the F statistics (see 
# p. 826 in the text) 
if (i>=start){ 

mub=i*apply (muacca,2,var) 

muw=apply (muaccv,2,mean)/(i-1) 

taub=i*apply (tauacca,2,var) 

tauw=apply (tauaccv,2,mean)/(i-1) 

psib=i*var (psiacca) 

psiw=mean(psiaccv)/(i-1) 

fs[i-start+1,]=c (mub/muw, taub/tauw, psib/psiw) 

I 


Return a list with useful information: the last value of each 
parameter for all chains, the F statistics, the input information, 
and the starting vectors. The return value will contain enough 
information to allow us to start all the Markov chains and 
simulate them as long as we wish. 

list (Qnu=muval ,tau=tauval, psi=psival,fstat=fs,nburn=nburn,start=start, 

n=n, ybar=ybar , w=w, hyp=hyp ,nchain=nchain, startvec=startvec) 


} 


# HH HH HW 


A similar, but simpler, function will simulate a single chain after we have finished burn-in: 


hierchain=function(nsim,mu,tau,psi,n,ybar,w,hyp){ 


# 

# Run a Markov chain for "nsim" simulations from initial values "mu", 
# "tau", "psi"; the data are "n", "ybar", "w"; the prior 

# hyperparameters are "hyp". 

# 


# ngroup is the number of groups 
ngroup=length(ybar) 
# Set up matrices to hold the simulated parameter values 
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psiex=rep(0,nsim) 

muex=matrix(0,nsim,ngroup) 

tauex=muex 

# Loop through the simulations 

for(i in 1:nsim){ 

# Loop through the coordinates 

for(j in 1:ngroup){ 

# Generate the next value of mu 
temp=mugen(j,tau,psi,n,ybar,w,hyp) 
mu[j]=temp 

# Store the value of mu 
muex[i, j]=temp 

# Do the same for tau 
temp=taugen(j,mu,psi,n,ybar,w,hyp) 
tau[j]=temp 
tauex li, j]=temp 

} 

# Do the same for psi 

temp=psigen(mu,tau,n,ybar,w,hyp) 

psi=temp 

psiex[li]=temp 

} 

# Return a list with useful information: The simulated values 

list (mu=muex, tau=tauex, psi=psiex) 


} 


Next, we have a function that will run several independent chains and put the results together. It calls 
the previous function once for each chain. 


stackchains=function(burn,nsim) { 


i 
# Starting from the information in "burn", obtained from "burnchain", 
# run "nsim" additional simulations for each chain and stack the 
# results on top of each other. The 
# results from chain i can be extracted by using rows 
# (i-1)*nsim to i*tnsim of each parameter matrix 
# 
# Set up storage for parameter values 
muex=NULL 
tauex=NULL 
psiex=NULL 


# Loop through the chains 
for(1 in 1:burn$nchain) { 
# Extract the last burn-in parameter value for chain 1 
mu=burn$mu[1, ] 
tau=burn$tau[1, ] 
psi=burn$psi [1] 
# Run the chain nsim times 
temp=hierchain(nsim,mu,tau, psi, burn$n, burn$ybar , burn$w, burn$hyp) 
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# Extract the simulated values from each chain and stack them. 
muex=rbind (muex, temp$mu) 

tauex=rbind (tauex, temp$tau) 

psiex=c(psiex,temp$psi) 
} 
# Return a list with useful information: the simulated values, the 
# number of simulations per chain, and the number of chains. 
list (nu=muex, tau=tauex, psi=psiex,nsim=nsim,nchain=burn$nchain) 


} 


The calculations done in Example 12.5.6 begin by applying the above functions and then manipulating 
the output. The following commands were typed at the R command prompt >. Notice that some of them 
produce output that appears in the same window in which the typing is done. The summary statistics appear 
in Table 12.4 in the text (after correcting the errors). 


> # Do the burn-in 

> hotdog. burn=burnchain(100,100,6,mu,tau,psi,n,ybar,w,hyp,2) 
> # Note that the F statistics are all less than 1+0.44m=45. 
> hotdog.burn$fstat 


[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 
[1,] 0.799452 0.7756733 1.464278 1.831631 0.9673807 0.4030658 1.161147 2.727503 
[,9] 


[1,] 0.548359 

> # Now run each chain 10000 more times. 

> hotdog.mcmc=stackchains (hotdog. burn, 10000) 
> # Obtain the data for the summary table in the text 
> apply Chotdog.mcmc$mu, 2,mean) 

[1] 156.5894 158.2559 120.5360 159.5841 

> sqrt (apply Chotdog.mcmc$mu,2,var) ) 

[1] 4.893067 5.825234 5.552140 7.615332 

> apply (1/hotdog.mcmc$tau,2,mean) 

[1] 495.6348 608.4955 542.8819 568.2482 

> sqrt (apply (1/hotdog.mcmc$tau,2,var) ) 

[1] 166.0203 221.1775 201.6250 307.3618 

> mean(hotdog.mcmc$psi) 

[1] 151.0273 

> sqrt (var (hotdog.mcmc$psi) ) 

[1] 11.16116 


Next, we source a file that computes values that we can use to assess how similar/different the four groups 
of hot dogs are. 


# Compute the six ratios of precisions (or variances) 

hotdog. ratio=cbind (hotdog.mcmc$tau[,1]/hotdog.mcmc$tau[,2], 
hotdog.mcmc$tauL,1]/hotdog.mcmc$tau[,3] ,hotdog.mcmc$tau[,1]/hotdog.mcmc$tau[,4], 
hotdog .mcmc$tauL, 2] /hotdog.mcmc$taul,3] ,hotdog.mcmc$tau[,2]/hotdog.mcmc$taul[,4], 
hotdog.mcmc$tau[,3]/hotdog.mcmc$tauL,4]) 

# For each simulation, find the maximum ratio. We need to include one over 

# each ratio also. 

hotdog. rmax=apply (cbind(hotdog.ratio,1/hotdog.ratio) ,1,max) 
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# Compute the six differences between means. 
hotdog.diff=cbind(hotdog.mcmc$mu[,1]-hotdog.mcmc$mu[,2] , 

hotdog .mcmc$mu[, 2] -hotdog.mcmc$mu[,3] ,hotdog.mcmc$mu[,3]-hotdog.mcmc$mu[,4] , 
hotdog.mcmc$mu[,4]-hotdog.mcmc$mu[,1] ,hotdog.mcmc$mu[,1]-hotdog.mcmc$mu[, 3] , 
hotdog.mcmc$mu[,2]-hotdog.mcmc$mu[,4] ) 

# For each simulation, find the minimum, maximum and average absolute 

# differences. 

hotdog.min=apply (abs (hotdog.diff),1,min) 
hotdog.max=apply (abs (hotdog.diff) ,1,max) 

hotdog. ave=apply (abs (hotdog.diff) ,1,mean) 


Using the results of the above calculations, we now type commands at the prompt that answer various 
questions. First, what proportion of the time is one of the ratios of standard deviations at least 1.5 (ratio 
of variances at least 2.25)? In this calculation the vector hotdog.max>2.25 has coordinates that are either 
TRUE (1) or FALSE (0) depending on whether the maximum ratio is greater than 2.25 or not. The mean is 
then the proportion of TRUEs. 


mean (hotdog. rmax>2.25) 
[1] 0.3982667 


Next, compute the 0.01 quantile of the maximum absolute difference between between the means, the 
median of the minimum difference, and the 0.01 quantile of the average difference. In 99% of the simulations, 
the difference was greater than the 0.01 quantile. 


> quantile (hotdog.max,0.01) 
1% 

26.3452 

> median (hotdog.min) 

[1] 2.224152 

> quantile (hotdog.ave,0.01) 
1%, 

13.77761 


In Example 12.5.7, we needed to simulate a pair of observations (Yi, Y3) from each parameter vector and 
then approximate the 0.05 and 0.95 quantiles of the distribution of Y; — Y3 for a prediction interval. The 
next function allows one to compute a general function of the parameters and find simulation standard errors 
using Eq. (12.5.1), that is $/k/2. 


mcmcse=function(simobj ,func,entire=FALSE) { 


Start with the result of a simulation "simobj", compute a vector function 
"func" from each chain, and then compute formula (12.5.1) for each 
coordinate as well as the covariance matrix. If "entire" is TRUE, 
it also computes the function value on the entire parameter 
matrix. This may differ from the average over the chains if "func" 
is not and average and/or if it does additional simulation. 

Also computes the avearge of the "func" 

values. The function "func" must take as arguments matrices of mu, 
tau, and psi values with each row from a single simulation. It 
must return a real vector. For example, if you want two 

quantiles of the distribution of f(Yi,Yj) where Yi comes from group 


# HH HH HH HH H OH HF 
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i and Yj comes from group j, func should loop through the rows of 
its input and simulate a pair (Yi,Yj) for each parameter. Then it 
should return the appropriate sample quantiles of the simulated 
values of f(Yj,Yj). 


k is the number of chains, nsim the number of simulations 
k=simobj$nchain 
nsim=simobj$nsim 
# Loop through the chains 
for(i in 1:k){ 
# Extract the parameters for chain i 
mu=simobj$mu[((i-1)*nsim) : (i*nsim) ,] 
tau=simobj$tau[((i-1)*nsim) : (i*nsim) ,] 
psi=simobj$psi[((i-1)*nsim) : (itnsim) ] 
# Compute the function value based on the parameters of chain i 


if (i==1){ 
valf=func (mu, tau, psi) 
selsef 
valf=rbind(valf,func(mu,tau,psi)) 


is 
t 
# p is how many functions were computed 
p=ncol (valf) 
# compute the average of each function 
ave=apply (valf ,2,mean) 
# compute formula (12.5.1) for each function 
se=sqrt (apply (valf ,2,var)*(k-1))/k 
# 
# Return the average function value, formula (12.5.1), and covariance 
# matrix. The covariance matrix can be useful if you want to 
tf compute a further function of the output and then compute a 
tf simulation standard error for that further function. Also computes 
# the function on the entire parameter set if "entire=TRUE". 
if (entire) { 
list (ave=ave,se=se, covmat=cov(valf) *(k-1)/k*2, 
entire=func(simobj$mu, simobj$tau,simobj$psi) ) 
telsef 
list (ave=ave,se=se, covmat=cov (valf) *(k-1) /k*2) 
t 
- 


The specific function func used in Example 12.5.7 is: 


hotdog13=function(mu,tau,psi){ 
i 
# Compute the 0.05 and 0.95 quantiles of predictive distribution of Y1-Y3 
# 
n=nrow (mu) 
# Make a place to store the differences. 
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vals=rep(0,n) 
# Loop through the parameter vectors. 
for(: in den){ 
# Simulate a difference. 
vals [i]=rnorm(1)/sqrt (tauli,1])+mu[i,1]-(rnorm(1)/ 
sqrt (tauli,3])+mu[i,3]) 
} 
# Return the desired quantiles 
quantile(vals,c(0.05,0.95)) 
} 


Finally, we use the above functions to compute the prediction interval in Example 12.5.7 along with the 
simulation standard errors. 


> hotdog. pred=mcmcse (hotdog.mcmc, hotdog13,T) 
> hotdog.pred 
$ave 
5% 95% 
-18.57540 90.20092 


$se 
BY 95% 
0.2228034 0.4345629 


$covmat 

5h 95% 
5% 0.04964136 0.07727458 
95% 0.07727458 0.18884493 


$entire 
bY 95% 
-18.49283 90.62661 


The final line hotdog. pred$entire gives the prediction interval based on the entire collection of 60,000 sim- 
ulations. The one listed as hotdog. pred$ave is the average of the six intervals based on the six independent 
Markov chains. There is not much difference between them. The simulation standard errors show up as 
hotdog. pred$se. Remember that the numbers in the first printing don’t match these because of the error 
mentioned earlier. 


